[00:03:44] PROBLEM - Check systemd state on an-airflow1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:14] PROBLEM - Check systemd state on an-airflow1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:23:14] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10odimitrijevic) Approved. [00:27:34] 10SRE-Access-Requests: Additional approvers for analytics-privatedata-users - https://phabricator.wikimedia.org/T356132 (10odimitrijevic) [00:38:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/993475 [00:39:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/993475 (owner: 10TrainBranchBot) [00:46:20] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:12] (03CR) 10Andrea Denisse: "I added instructions on process to update UID/GID on our current hosts in here: https://phabricator.wikimedia.org/T352665#9496794" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [00:51:41] (03PS1) 10Cwhite: logging::collector: add mw accesslog sampling by benthos [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) [00:51:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:59:26] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T356138 (10phaultfinder) [01:04:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/993475 (owner: 10TrainBranchBot) [01:06:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:13:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:23] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) @xcollazo So after checking with Josh from ITS, the mail should BOTH arrive in your personal inbox and also show up in the groups dashboard. The part th... [01:16:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:22:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:22:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:22:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:33:03] (03PS1) 10Dzahn: phabricator: fix team name in smtp monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/993822 [01:33:39] (03CR) 10Dzahn: [C: 03+2] "compare line 21" [puppet] - 10https://gerrit.wikimedia.org/r/993822 (owner: 10Dzahn) [01:38:56] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:49:08] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:21] (ProbeDown) firing: (6) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:49:43] (03PS1) 10TTO: Enable PageNotice extension in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993824 (https://phabricator.wikimedia.org/T61245) [01:54:02] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:56:02] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:52] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:26:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:31:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:39:24] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:50] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:51:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Ahoelzl) @Eevans @ABran-WMF can you help us move this over the finishing line? It's blocking Aleksandar from getting productive in the data engineerin... [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T0300) [03:07:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.16 [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/993477 (https://phabricator.wikimedia.org/T354434) [03:07:22] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.16 [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/993477 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [03:09:25] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:27:52] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.16 [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/993477 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [03:29:50] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [03:30:01] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T0400) [04:02:12] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.13 (duration: 02m 09s) [04:03:30] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993828 (https://phabricator.wikimedia.org/T354434) [04:03:32] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993828 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [04:04:18] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993828 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [04:04:43] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.16 refs T354434 [04:04:49] T354434: 1.42.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T354434 [04:20:23] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T356146 (10phaultfinder) [04:57:22] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.16 refs T354434 (duration: 52m 38s) [04:57:27] T354434: 1.42.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T354434 [05:19:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355739 [05:19:26] T355739: Switchover s6 master (db2114 -> db2129) - https://phabricator.wikimedia.org/T355739 [05:19:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355739 [05:19:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2129 with weight 0 T355739', diff saved to https://phabricator.wikimedia.org/P55844 and previous config saved to /var/cache/conftool/dbconfig/20240130-051952-marostegui.json [05:20:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992442 (https://phabricator.wikimedia.org/T355739) (owner: 10Gerrit maintenance bot) [05:32:53] (03PS1) 10Marostegui: db1224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/993833 (https://phabricator.wikimedia.org/T354591) [05:38:57] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [05:40:05] !log Starting s6 codfw failover from db2114 to db2129 - T355739 [05:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:11] T355739: Switchover s6 master (db2114 -> db2129) - https://phabricator.wikimedia.org/T355739 [05:40:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s6 codfw as read-only for maintenance - T355739', diff saved to https://phabricator.wikimedia.org/P55845 and previous config saved to /var/cache/conftool/dbconfig/20240130-054025-root.json [05:40:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2129 to s6 primary and set section read-write T355739', diff saved to https://phabricator.wikimedia.org/P55846 and previous config saved to /var/cache/conftool/dbconfig/20240130-054053-root.json [05:40:56] marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [05:41:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2114 T355739', diff saved to https://phabricator.wikimedia.org/P55847 and previous config saved to /var/cache/conftool/dbconfig/20240130-054154-root.json [05:42:40] (03PS2) 10Marostegui: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/992443 (https://phabricator.wikimedia.org/T355739) (owner: 10Gerrit maintenance bot) [05:42:52] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/992443 (https://phabricator.wikimedia.org/T355739) (owner: 10Gerrit maintenance bot) [05:42:55] (03CR) 10Marostegui: [V: 03+2 C: 03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/992443 (https://phabricator.wikimedia.org/T355739) (owner: 10Gerrit maintenance bot) [05:43:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 36 hosts with reason: Primary switchover s1 T356059 [05:43:50] T356059: Switchover s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T356059 [05:44:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2112 with weight 0 T356059', diff saved to https://phabricator.wikimedia.org/P55848 and previous config saved to /var/cache/conftool/dbconfig/20240130-054410-marostegui.json [05:44:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 36 hosts with reason: Primary switchover s1 T356059 [05:45:06] (03PS2) 10Marostegui: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/993470 (https://phabricator.wikimedia.org/T356059) (owner: 10Gerrit maintenance bot) [05:45:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2112 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/993469 (https://phabricator.wikimedia.org/T356059) (owner: 10Gerrit maintenance bot) [05:50:39] (ProbeDown) firing: (6) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:07:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2146', diff saved to https://phabricator.wikimedia.org/P55849 and previous config saved to /var/cache/conftool/dbconfig/20240130-060727-root.json [06:10:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P55850 and previous config saved to /var/cache/conftool/dbconfig/20240130-061014-root.json [06:11:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:12:24] !log Starting s1 codfw failover from db2103 to db2112 - T356059 [06:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:36] T356059: Switchover s1 master (db2103 -> db2112) - https://phabricator.wikimedia.org/T356059 [06:12:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s1 codfw as read-only for maintenance - T356059', diff saved to https://phabricator.wikimedia.org/P55851 and previous config saved to /var/cache/conftool/dbconfig/20240130-061243-marostegui.json [06:13:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2112 to s1 primary and set section read-write T356059', diff saved to https://phabricator.wikimedia.org/P55852 and previous config saved to /var/cache/conftool/dbconfig/20240130-061305-marostegui.json [06:13:47] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/993470 (https://phabricator.wikimedia.org/T356059) (owner: 10Gerrit maintenance bot) [06:14:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2146', diff saved to https://phabricator.wikimedia.org/P55853 and previous config saved to /var/cache/conftool/dbconfig/20240130-061423-root.json [06:15:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2103 T356059', diff saved to https://phabricator.wikimedia.org/P55854 and previous config saved to /var/cache/conftool/dbconfig/20240130-061529-root.json [06:15:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P55855 and previous config saved to /var/cache/conftool/dbconfig/20240130-061552-root.json [06:16:33] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) [06:18:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T356064 [06:18:26] T356064: Switchover es4 codfw master es2020 -> es2021 - https://phabricator.wikimedia.org/T356064 [06:18:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T356064 [06:19:21] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993711 (https://phabricator.wikimedia.org/T356064) (owner: 10Marostegui) [06:20:08] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993711 (https://phabricator.wikimedia.org/T356064) (owner: 10Marostegui) [06:20:42] (03PS1) 10Marostegui: es4: Promote es2021 to master [puppet] - 10https://gerrit.wikimedia.org/r/993835 (https://phabricator.wikimedia.org/T356064) [06:21:18] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:993711|db-production.php: Disable writes on es4 (T356064)]] [06:21:57] (03PS1) 10Marostegui: wmnet: Update es4 master [dns] - 10https://gerrit.wikimedia.org/r/993836 (https://phabricator.wikimedia.org/T356064) [06:22:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T356064 [06:22:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T356064 [06:22:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2020 with weight 0 T356064', diff saved to https://phabricator.wikimedia.org/P55856 and previous config saved to /var/cache/conftool/dbconfig/20240130-062241-marostegui.json [06:22:55] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:993711|db-production.php: Disable writes on es4 (T356064)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:23:33] (03CR) 10Marostegui: [C: 03+2] es4: Promote es2021 to master [puppet] - 10https://gerrit.wikimedia.org/r/993835 (https://phabricator.wikimedia.org/T356064) (owner: 10Marostegui) [06:23:55] !log marostegui@deploy2002 marostegui: Continuing with sync [06:25:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:26:24] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:27:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P55857 and previous config saved to /var/cache/conftool/dbconfig/20240130-062714-root.json [06:27:48] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51307 bytes in 1.769 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:28:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:29:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1224 T354591', diff saved to https://phabricator.wikimedia.org/P55858 and previous config saved to /var/cache/conftool/dbconfig/20240130-062930-root.json [06:29:36] T354591: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 [06:29:37] (03CR) 10Marostegui: [C: 03+2] db1224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/993833 (https://phabricator.wikimedia.org/T354591) (owner: 10Marostegui) [06:30:29] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:993711|db-production.php: Disable writes on es4 (T356064)]] (duration: 09m 11s) [06:30:34] T356064: Switchover es4 codfw master es2020 -> es2021 - https://phabricator.wikimedia.org/T356064 [06:30:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P55859 and previous config saved to /var/cache/conftool/dbconfig/20240130-063057-root.json [06:31:43] (03PS1) 10Marostegui: Revert "es4: Promote es2021 to master" [puppet] - 10https://gerrit.wikimedia.org/r/993773 [06:35:57] !log Starting es4 codfw failover from es2020 to es2021 - T356064 [06:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:03] T356064: Switchover es4 codfw master es2020 -> es2021 - https://phabricator.wikimedia.org/T356064 [06:36:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2021 to es4 primary T356064', diff saved to https://phabricator.wikimedia.org/P55860 and previous config saved to /var/cache/conftool/dbconfig/20240130-063625-root.json [06:36:58] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es4 master [dns] - 10https://gerrit.wikimedia.org/r/993836 (https://phabricator.wikimedia.org/T356064) (owner: 10Marostegui) [06:41:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:42:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P55861 and previous config saved to /var/cache/conftool/dbconfig/20240130-064219-root.json [06:45:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Reduce es2021 weight T356064', diff saved to https://phabricator.wikimedia.org/P55862 and previous config saved to /var/cache/conftool/dbconfig/20240130-064512-root.json [06:45:18] T356064: Switchover es4 codfw master es2020 -> es2021 - https://phabricator.wikimedia.org/T356064 [06:45:49] (03CR) 10Marostegui: [C: 03+2] Revert "es4: Promote es2021 to master" [puppet] - 10https://gerrit.wikimedia.org/r/993773 (owner: 10Marostegui) [06:46:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P55864 and previous config saved to /var/cache/conftool/dbconfig/20240130-064602-root.json [06:47:38] (03PS1) 10Marostegui: es2020: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/993838 (https://phabricator.wikimedia.org/T356064) [06:48:56] !log marostegui@deploy2002 backport Cancelled [06:49:13] (03PS1) 10Marostegui: Revert "Revert "es4: Promote es2021 to master"" [puppet] - 10https://gerrit.wikimedia.org/r/993774 [06:49:27] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993775 [06:49:54] (03CR) 10Marostegui: [C: 03+2] es2020: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/993838 (https://phabricator.wikimedia.org/T356064) (owner: 10Marostegui) [06:50:39] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "es4: Promote es2021 to master"" [puppet] - 10https://gerrit.wikimedia.org/r/993774 (owner: 10Marostegui) [06:51:19] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993775 (owner: 10Marostegui) [06:51:28] (03PS2) 10Marostegui: es2020: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/993838 (https://phabricator.wikimedia.org/T356064) [06:52:01] (03CR) 10Marostegui: [V: 03+2 C: 03+2] es2020: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/993838 (https://phabricator.wikimedia.org/T356064) (owner: 10Marostegui) [06:53:35] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993775 (owner: 10Marostegui) [06:54:17] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:993775|Revert "db-production.php: Disable writes on es4"]] [06:55:43] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:993775|Revert "db-production.php: Disable writes on es4"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:55:46] !log marostegui@deploy2002 marostegui: Continuing with sync [06:57:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P55865 and previous config saved to /var/cache/conftool/dbconfig/20240130-065724-root.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T0700) [07:00:05] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T0700). [07:00:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T356060 [07:00:31] T356060: Switchover x2 codfw master db2142 -> db2144 - https://phabricator.wikimedia.org/T356060 [07:00:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T356060 [07:01:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P55866 and previous config saved to /var/cache/conftool/dbconfig/20240130-070107-root.json [07:02:05] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:993775|Revert "db-production.php: Disable writes on es4"]] (duration: 07m 48s) [07:06:07] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) [07:07:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 1%: After switchover', diff saved to https://phabricator.wikimedia.org/P55867 and previous config saved to /var/cache/conftool/dbconfig/20240130-070757-root.json [07:12:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2144 to x2 master T356060', diff saved to https://phabricator.wikimedia.org/P55868 and previous config saved to /var/cache/conftool/dbconfig/20240130-071202-root.json [07:12:14] T356060: Switchover x2 codfw master db2142 -> db2144 - https://phabricator.wikimedia.org/T356060 [07:12:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P55869 and previous config saved to /var/cache/conftool/dbconfig/20240130-071229-root.json [07:16:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2103 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P55870 and previous config saved to /var/cache/conftool/dbconfig/20240130-071612-root.json [07:23:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 5%: After switchover', diff saved to https://phabricator.wikimedia.org/P55871 and previous config saved to /var/cache/conftool/dbconfig/20240130-072302-root.json [07:27:31] (03PS1) 10Marostegui: db2144: Promote to master [puppet] - 10https://gerrit.wikimedia.org/r/994011 (https://phabricator.wikimedia.org/T356060) [07:27:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P55872 and previous config saved to /var/cache/conftool/dbconfig/20240130-072734-root.json [07:29:02] (03CR) 10Marostegui: [C: 03+2] db2144: Promote to master [puppet] - 10https://gerrit.wikimedia.org/r/994011 (https://phabricator.wikimedia.org/T356060) (owner: 10Marostegui) [07:30:10] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) [07:32:47] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T356069 [07:32:52] T356069: Switchover s3 master (db2105 -> db2127) - https://phabricator.wikimedia.org/T356069 [07:32:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2127 with weight 0 T356069', diff saved to https://phabricator.wikimedia.org/P55873 and previous config saved to /var/cache/conftool/dbconfig/20240130-073257-marostegui.json [07:33:18] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T356069 [07:36:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2127 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/993472 (https://phabricator.wikimedia.org/T356069) (owner: 10Gerrit maintenance bot) [07:38:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P55874 and previous config saved to /var/cache/conftool/dbconfig/20240130-073807-root.json [07:45:28] PROBLEM - BFD status on lsw1-b7-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:45:46] (03PS2) 10Muehlenhoff: airflow: Keep Python 2 due to Hive [puppet] - 10https://gerrit.wikimedia.org/r/993727 (https://phabricator.wikimedia.org/T335261) [07:46:20] !log Starting s3 codfw failover from db2105 to db2127 - T356069 [07:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:25] T356069: Switchover s3 master (db2105 -> db2127) - https://phabricator.wikimedia.org/T356069 [07:46:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s3 codfw as read-only for maintenance - T356069', diff saved to https://phabricator.wikimedia.org/P55875 and previous config saved to /var/cache/conftool/dbconfig/20240130-074634-marostegui.json [07:46:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2127 to s3 primary and set section read-write T356069', diff saved to https://phabricator.wikimedia.org/P55876 and previous config saved to /var/cache/conftool/dbconfig/20240130-074656-marostegui.json [07:47:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2105 T356069', diff saved to https://phabricator.wikimedia.org/P55877 and previous config saved to /var/cache/conftool/dbconfig/20240130-074746-root.json [07:47:56] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/993473 (https://phabricator.wikimedia.org/T356069) (owner: 10Gerrit maintenance bot) [07:48:00] (03PS2) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/993473 (https://phabricator.wikimedia.org/T356069) (owner: 10Gerrit maintenance bot) [07:48:46] (03CR) 10Marostegui: [V: 03+2 C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/993473 (https://phabricator.wikimedia.org/T356069) (owner: 10Gerrit maintenance bot) [07:50:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: After switchover', diff saved to https://phabricator.wikimedia.org/P55878 and previous config saved to /var/cache/conftool/dbconfig/20240130-075035-root.json [07:52:33] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) db2105 is no longer a master. This host can be done after being depooled [07:52:43] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) [07:53:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P55879 and previous config saved to /var/cache/conftool/dbconfig/20240130-075314-root.json [07:53:15] !log ayounsi@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2034.codfw.wmnet to cluster codfw02 and group AB [07:53:31] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Marostegui) >>! In T355862#9494195, @Marostegui wrote: > db2142 - x2 master > db2103 - s1 master > es2020 - es4 master The... [07:55:07] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2034.codfw.wmnet to cluster codfw02 and group AB [07:57:27] Hmm. Reply Tool is giving me a "Unable to fetch Parsoid HTML" on both enWS and enWP right now. [07:58:05] I don't think enWS gets a new deploy until later today, and enWP tomorrow? [07:58:29] If so, is there a service with a blinking red status light somewhere? [07:59:23] Can confirm the same issue on enwiktionary [07:59:26] Can't submit a reply [07:59:51] Normal editing works fine. Haven't tested Visual Editor. [08:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T0800) [08:00:05] tto: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:22] * tto waves [08:00:49] 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), 10Performance Issue: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Midleading) Is it possible to keep prerendered thumbnails indefinitely? I don't want to see search results full o... [08:02:21] Enabling an extension needs a lot of checks and such. e.g. the security review [08:02:32] Such checks have taken place years ago, Amir1 [08:02:42] The task has been open for 10 years+ [08:02:57] Hmm. Visual Editor seems to work fine, so it's specifically something Reply Tool uses. Maybe via the Action API? [08:03:05] All the necessary prerequisites have been fulfilled, including a test period on beta [08:03:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10ABran-WMF) on it [08:03:16] let me check [08:04:17] Amir1, relevant task is https://phabricator.wikimedia.org/T229718 [08:04:33] 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), 10Performance Issue: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) >>! In T211661#9497396, @Midleading wrote: > Is it possible to keep prerendered thumbnails indefinitel... [08:04:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) @Jclark-ctr this host is now off, you can proceed whenever you want. [08:05:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: After switchover', diff saved to https://phabricator.wikimedia.org/P55880 and previous config saved to /var/cache/conftool/dbconfig/20240130-080540-root.json [08:05:47] thanks. Let me read a bit on this [08:06:15] Let me know if you have any Qs [08:06:26] deploying new extensions is a bit of complicated task (it's just a flipping a switch but you need to make sure a lot of stuff are in place before doing so) [08:06:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 10%: Switchover', diff saved to https://phabricator.wikimedia.org/P55881 and previous config saved to /var/cache/conftool/dbconfig/20240130-080644-root.json [08:06:47] The extension is already "deployed" per se, as I understand the use of the term [08:07:01] It's enabled on enwiktionary in beta cluster [08:08:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P55882 and previous config saved to /var/cache/conftool/dbconfig/20240130-080819-root.json [08:12:24] tto: I have a couple of suggestions: 1- first enable it in one wiki only, I'd say testwiki, then enwiktionary, then rest of wikis, this is to make sure everything will work just fine. 2- It's adding a couple of messages being load in every page view, that's generally not an issue but I'll need to double check the caching for messages otherwise it's going to bring down everything 3- Please add tests to the extension [08:12:44] For now, let's enable it in testwiki, wanna change the patch to reflect that? [08:13:02] Yes, no worries, I will do that [08:13:17] PROBLEM - MariaDB read only x2 #page on db2144 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.6.16-MariaDB-log, Uptime 3385s, event_scheduler: True, 2935.22 QPS, connection latency: 0.003999s, query latency: 0.000440s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:13:23] I will fix that [08:13:42] Thanks :-) [08:13:51] leftover from x2 switch, sorry about it [08:13:54] fixed now [08:14:01] thanks [08:14:18] <_joe_> thanks [08:14:29] RECOVERY - MariaDB read only x2 #page on db2144 is OK: Version 10.6.16-MariaDB-log, Uptime 3458s, read_only: False, event_scheduler: True, 3504.91 QPS, connection latency: 0.004117s, query latency: 0.000422s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:14:59] (03PS2) 10TTO: Enable PageNotice extension in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993824 (https://phabricator.wikimedia.org/T61245) [08:15:23] Amir1 ^ [08:15:51] (03PS3) 10TTO: Enable PageNotice extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993824 (https://phabricator.wikimedia.org/T61245) [08:16:13] fixed commit message in PS3 [08:16:52] (03CR) 10Ladsgroup: [C: 03+2] Enable PageNotice extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993824 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [08:17:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993824 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [08:17:34] (03Merged) 10jenkins-bot: Enable PageNotice extension on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993824 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [08:18:15] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:993824|Enable PageNotice extension on testwiki (T61245)]] [08:18:21] T61245: Review the PageNotice extension for deployment - https://phabricator.wikimedia.org/T61245 [08:19:42] !log ladsgroup@deploy2002 ladsgroup and tto: Backport for [[gerrit:993824|Enable PageNotice extension on testwiki (T61245)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:20:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 50%: After switchover', diff saved to https://phabricator.wikimedia.org/P55883 and previous config saved to /var/cache/conftool/dbconfig/20240130-082045-root.json [08:20:55] tto: live in mwdebug [08:20:59] do you know how to test it? [08:21:02] Yes [08:21:04] Will test now [08:21:32] Can confirm working [08:21:34] Amir12 [08:21:36] Amir1 [08:21:45] awesome [08:21:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 25%: Switchover', diff saved to https://phabricator.wikimedia.org/P55884 and previous config saved to /var/cache/conftool/dbconfig/20240130-082149-root.json [08:22:11] How long do you recommend to leave it on testwiki only before enabling on any other wiki? [08:22:17] !log ladsgroup@deploy2002 ladsgroup and tto: Continuing with sync [08:22:18] Would 1 week be good? [08:22:42] yeah but a perf review should be done in the mean time, just to check the caching of the messages [08:22:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/hue] - 10https://gerrit.wikimedia.org/r/993708 (https://phabricator.wikimedia.org/T349400) (owner: 10Brouberol) [08:23:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P55885 and previous config saved to /var/cache/conftool/dbconfig/20240130-082324-root.json [08:23:24] And how should I go about that? [08:23:37] I'm a volunteer, I don't know who's who and don't really have rights over anyone's time [08:23:40] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) To keep on the radar but non-blocking : {T309724} [08:26:11] tto: I'll help you with that [08:27:12] 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), 10Performance Issue: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Midleading) Yes, the thumbnails can be regenerated. But with a page load time of 10+ seconds... This is unaccepta... [08:27:44] Amir1, thanks. I'll write this down at T61245 [08:27:45] T61245: Review the PageNotice extension for deployment - https://phabricator.wikimedia.org/T61245 [08:27:57] Would you like a subtask for that? Or not needed? [08:28:23] a subtask would be fine [08:28:27] Thank you! [08:28:36] Are you Amire80? [08:28:38] on pha [08:28:38] b [08:28:40] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:993824|Enable PageNotice extension on testwiki (T61245)]] (duration: 10m 24s) [08:28:44] nope, I'm Ladsgroup [08:28:50] That's the other Amir :P [08:28:57] I knew there were two! Heh [08:29:03] Of course I pick the wrong one. [08:29:06] Good thing I checked :) [08:29:13] Thanks for your help. [08:29:21] don't worry, we are getting mistaken at least once a week [08:29:36] !log upgrading python-pymysql on remaining DB hosts to 1.0.2-2~wmf11u1 T355531 [08:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:41] T355531: Migrate all db-* scripts to Bookworm - https://phabricator.wikimedia.org/T355531 [08:29:44] (03CR) 10Brouberol: [C: 03+1] hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/993743 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene) [08:29:54] It could be worse, like the three Br[iy][ao]ns we once had (still have?) [08:30:35] (03CR) 10Brouberol: [C: 03+1] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/993742 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene) [08:30:48] (03CR) 10Brouberol: [C: 03+1] Add dummy keytabs for new an-worker1157-1175 [labs/private] - 10https://gerrit.wikimedia.org/r/993675 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene) [08:32:02] (03PS1) 10WMDE-Fisch: Don't bail out early when there are no selectors configured [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994028 (https://phabricator.wikimedia.org/T355933) [08:35:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: After switchover', diff saved to https://phabricator.wikimedia.org/P55886 and previous config saved to /var/cache/conftool/dbconfig/20240130-083550-root.json [08:36:39] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10Wikimedia-Apache-configuration, and 3 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10LSobanski) AS the original task was declined without comment, it would be helpful to understand what the input... [08:36:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 50%: Switchover', diff saved to https://phabricator.wikimedia.org/P55887 and previous config saved to /var/cache/conftool/dbconfig/20240130-083654-root.json [08:38:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2020 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P55888 and previous config saved to /var/cache/conftool/dbconfig/20240130-083829-root.json [08:46:16] (03PS1) 10Volans: scripts: remove hiera export script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/994111 [08:46:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: cloudweb: enable envoy services_proxy on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/993673 (https://phabricator.wikimedia.org/T255568) (owner: 10Majavah) [08:50:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: After switchover', diff saved to https://phabricator.wikimedia.org/P55889 and previous config saved to /var/cache/conftool/dbconfig/20240130-085055-root.json [08:52:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 75%: Switchover', diff saved to https://phabricator.wikimedia.org/P55890 and previous config saved to /var/cache/conftool/dbconfig/20240130-085159-root.json [08:52:26] The "Unable to fetch Parsoid HTML" with Reply Tool seems to have cleared now, probably because MatmaRex showed up and whatever service was bugging got scared and fixed itself. [08:52:49] :o [08:53:06] i just saw that bug report and was trying to find anythign related in the error logs [08:57:48] !log restart swift-object-replicator on ms-be1068 [08:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:25] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host an-tool1008.eqiad.wmnet with OS bullseye [09:03:42] (03PS1) 10Ayounsi: Ganeti: readvertise netdev_master VIP [puppet] - 10https://gerrit.wikimedia.org/r/994114 (https://phabricator.wikimedia.org/T300152) [09:04:24] (03PS1) 10Slyngshede: P:debmonitor::server Correct uwsgi configuration on WMCS. [puppet] - 10https://gerrit.wikimedia.org/r/994115 [09:05:24] (03CR) 10Ayounsi: [C: 03+2] "tested locally" [puppet] - 10https://gerrit.wikimedia.org/r/994114 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:06:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1237/console" [puppet] - 10https://gerrit.wikimedia.org/r/994115 (owner: 10Slyngshede) [09:07:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 100%: Switchover', diff saved to https://phabricator.wikimedia.org/P55891 and previous config saved to /var/cache/conftool/dbconfig/20240130-090704-root.json [09:07:42] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1238/console" [puppet] - 10https://gerrit.wikimedia.org/r/994115 (owner: 10Slyngshede) [09:11:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/994115 (owner: 10Slyngshede) [09:11:29] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-tool1008.eqiad.wmnet with reason: host reimage [09:12:05] (03PS3) 10Arnaudb: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [09:13:42] (03CR) 10CI reject: [V: 04-1] admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [09:14:17] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-tool1008.eqiad.wmnet with reason: host reimage [09:15:26] PROBLEM - BFD status on lsw1-a4-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:15:35] (03PS4) 10Arnaudb: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [09:17:11] (03CR) 10CI reject: [V: 04-1] admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [09:18:00] PROBLEM - ganeti-wconfd running on ganeti2033 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:21:06] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:debmonitor::server Correct uwsgi configuration on WMCS. [puppet] - 10https://gerrit.wikimedia.org/r/994115 (owner: 10Slyngshede) [09:21:18] (03CR) 10Ayounsi: [C: 03+1] scripts: remove hiera export script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/994111 (owner: 10Volans) [09:30:12] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-tool1008.eqiad.wmnet with OS bullseye [09:30:45] (03PS1) 10Ilias Sarantopoulos: admin_ng: elevate ml users expermintal permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) [09:32:13] (03PS2) 10Ilias Sarantopoulos: admin_ng: elevate ml users experimental permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) [09:33:46] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10Wikimedia-Apache-configuration, and 3 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10MoritzMuehlenhoff) >>! In T337949#9497483, @LSobanski wrote: > AS the original task was declined without comme... [09:35:02] (03PS3) 10Ilias Sarantopoulos: admin_ng: elevate ml users experimental permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) [09:35:54] (03CR) 10Muehlenhoff: admin: add amastilovic to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [09:38:57] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [09:40:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (patch and descriptions on task)" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [09:42:00] (03PS2) 10Majavah: wikimediacloud.org: Move Rabbit traffic back to all nodes [dns] - 10https://gerrit.wikimedia.org/r/993062 (https://phabricator.wikimedia.org/T345610) [09:42:54] (03CR) 10Majavah: [C: 03+2] wikimediacloud.org: Move Rabbit traffic back to all nodes [dns] - 10https://gerrit.wikimedia.org/r/993062 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [09:43:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] admin_ng: elevate ml users experimental permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [09:46:39] 10SRE, 10serviceops: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10jnuche) 05Open→03Resolved a:03jnuche After my recent changes to the staging code, failures during the presync shouldn't cause old versions to pile up anymore. Last night presy... [09:49:18] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10klausman) [09:50:39] (ProbeDown) firing: (6) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:51:14] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: cloudweb: enable envoy services_proxy on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/993673 (https://phabricator.wikimedia.org/T255568) (owner: 10Majavah) [09:52:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] mesh.configuration: Add sampling support in tracing (copy paste patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/993097 (https://phabricator.wikimedia.org/T351567) (owner: 10Alexandros Kosiaris) [09:52:33] (03CR) 10Alexandros Kosiaris: "Thanks! I 'll followup with the mediawiki patch to utilize this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [09:52:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [09:53:47] (03Merged) 10jenkins-bot: mesh.configuration: Add sampling support in tracing (copy paste patch) [deployment-charts] - 10https://gerrit.wikimedia.org/r/993097 (https://phabricator.wikimedia.org/T351567) (owner: 10Alexandros Kosiaris) [09:53:49] (03Merged) 10jenkins-bot: tracing: Add local_service/support random sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/993098 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [09:54:27] (03PS1) 10Muehlenhoff: Also make debmonitor1003 a debmonitor host [puppet] - 10https://gerrit.wikimedia.org/r/994120 (https://phabricator.wikimedia.org/T241049) [09:56:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host phab1004.eqiad.wmnet [09:59:00] (03PS1) 10Muehlenhoff: Switch phab1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/994121 (https://phabricator.wikimedia.org/T349619) [10:00:15] hi, I'm gonna stop the CI Jenkins in ~10m to perform an upgrade [10:00:17] !log gmodena@deploy2002 Started deploy [airflow-dags/analytics@ccaa5dc]: (no justification provided) [10:00:54] !log gmodena@deploy2002 Finished deploy [airflow-dags/analytics@ccaa5dc]: (no justification provided) (duration: 00m 37s) [10:01:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This still brings in needlessly the mesh parts of the module. We probably want to depend on https://gerrit.wikimedia.org/r/c/operations/de" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:01:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch phab1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/994121 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:02:17] 10SRE, 10LDAP: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 (10eoghan) @thcipriani Can you give a quick approval for @Aklapper, please? [10:02:31] 10SRE, 10LDAP-Access-Requests, 10LDAP: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 (10eoghan) [10:05:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] Alert for containers with memory issues (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [10:06:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Actually merging this" [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [10:06:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [10:06:35] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10Wikimedia-Apache-configuration, and 3 others: Add security.txt to Wikimedia sites? (2023 edition) - https://phabricator.wikimedia.org/T337949 (10ABran-WMF) p:05Triage→03Medium [10:06:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host phab1004.eqiad.wmnet [10:07:39] (03Merged) 10jenkins-bot: Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [10:08:00] (03CR) 10Slyngshede: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/994120 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:08:04] (03PS5) 10Arnaudb: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [10:08:25] (03CR) 10Arnaudb: admin: add amastilovic to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [10:09:34] RECOVERY - Check systemd state on an-airflow1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:37] (03CR) 10CI reject: [V: 04-1] admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [10:09:42] RECOVERY - Check systemd state on an-airflow1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:44] (03PS1) 10Ayounsi: Add sretest2005 to partman/site.pp [puppet] - 10https://gerrit.wikimedia.org/r/994123 [10:13:49] (03CR) 10Btullis: [C: 03+1] "Thanks. Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/993727 (https://phabricator.wikimedia.org/T335261) (owner: 10Muehlenhoff) [10:13:53] (03CR) 10Filippo Giunchedi: [C: 03+2] jaeger: add oauth2-proxy sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [10:15:47] (03CR) 10Volans: [C: 03+2] scripts: remove hiera export script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/994111 (owner: 10Volans) [10:16:02] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-airflow1005.eqiad.wmnet with OS bullseye [10:16:22] (03Merged) 10jenkins-bot: scripts: remove hiera export script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/994111 (owner: 10Volans) [10:16:35] (03Abandoned) 10Brouberol: Build hue for Debian Bullseye by default [debs/hue] - 10https://gerrit.wikimedia.org/r/993708 (https://phabricator.wikimedia.org/T349400) (owner: 10Brouberol) [10:16:59] Jenkins update completed [10:17:15] (03PS1) 10Arnaudb: admin: add 4 users to group releng [puppet] - 10https://gerrit.wikimedia.org/r/993480 (https://phabricator.wikimedia.org/T356043) [10:18:40] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994123 (owner: 10Ayounsi) [10:18:46] (03CR) 10Ayounsi: [C: 03+2] Add sretest2005 to partman/site.pp [puppet] - 10https://gerrit.wikimedia.org/r/994123 (owner: 10Ayounsi) [10:23:16] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [10:23:33] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host srestest2005.codfw.wmnet [10:23:35] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:24:31] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add dummy keytabs for new an-worker1157-1175 [labs/private] - 10https://gerrit.wikimedia.org/r/993675 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene) [10:24:41] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [10:24:58] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host srestest2005.codfw.wmnet [10:25:58] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host srestest2005.codfw.wmnet [10:25:59] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:26:39] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1005.eqiad.wmnet with reason: host reimage [10:27:29] (03PS1) 10Alexandros Kosiaris: mediawiki: Bump mesh modules minor versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994146 (https://phabricator.wikimedia.org/T351567) [10:28:06] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [10:28:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [10:28:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:28:59] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache srestest2005.codfw.wmnet on all recursors [10:29:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) srestest2005.codfw.wmnet on all recursors [10:29:04] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:31:40] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:31:46] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:31:50] (03PS3) 10Hnowlan: kubernetes: make 5 jobrunners kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/993714 (https://phabricator.wikimedia.org/T354791) [10:31:56] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:32:04] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1005.eqiad.wmnet with reason: host reimage [10:32:16] !log volans@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox-canary [10:33:15] (03CR) 10Muehlenhoff: admin: add amastilovic to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [10:33:28] (03CR) 10Muehlenhoff: [C: 03+2] airflow: Keep Python 2 due to Hive [puppet] - 10https://gerrit.wikimedia.org/r/993727 (https://phabricator.wikimedia.org/T335261) (owner: 10Muehlenhoff) [10:34:35] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [10:34:43] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [10:34:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:35:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [10:35:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:35:27] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache srestest2005.codfw.wmnet on all recursors [10:35:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) srestest2005.codfw.wmnet on all recursors [10:35:31] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host srestest2005.codfw.wmnet [10:38:15] (03CR) 10EoghanGaffney: [C: 03+1] "Approvals for the members are in the linked phabricator ticket, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/993480 (https://phabricator.wikimedia.org/T356043) (owner: 10Arnaudb) [10:38:58] (03CR) 10Majavah: [C: 04-1] "The ticket is asking to be added to the 'releng' LDAP group and not the 'release-engineering' Unix group." [puppet] - 10https://gerrit.wikimedia.org/r/993480 (https://phabricator.wikimedia.org/T356043) (owner: 10Arnaudb) [10:39:24] (03CR) 10Effie Mouzeli: "Ι see your point and I am happy to refactor after 991369, or after we have generally untied the gordian knot of dependancies. However, I f" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:41:36] jouncebot: next [10:41:36] In 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T1100) [10:41:42] jouncebot: now [10:41:42] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [10:43:33] (03PS1) 10Muehlenhoff: Enable Puppet for Phabricator on the role level [puppet] - 10https://gerrit.wikimedia.org/r/994148 (https://phabricator.wikimedia.org/T349619) [10:45:05] (03CR) 10Muehlenhoff: [C: 04-1] "Yeah, there's no need to record the releng LDAP membership for users who are already tracke in data.yaml to have signed an NDA." [puppet] - 10https://gerrit.wikimedia.org/r/993480 (https://phabricator.wikimedia.org/T356043) (owner: 10Arnaudb) [10:45:40] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [10:46:10] (03Abandoned) 10Arnaudb: admin: add 4 users to group releng [puppet] - 10https://gerrit.wikimedia.org/r/993480 (https://phabricator.wikimedia.org/T356043) (owner: 10Arnaudb) [10:46:49] (03CR) 10FNegri: [C: 03+1] "This seems fine to me, but it also goes way beyond my Exim knowledge, so I'm adding a +1 but I would recommend a look by someone with more" [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) (owner: 10Majavah) [10:49:58] 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), 10Performance Issue: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) I understand, we won't delete everything all at once. It'll be at least a slow rolling deletion. I wan... [10:50:27] (03CR) 10Muehlenhoff: [C: 03+2] Also make debmonitor1003 a debmonitor host [puppet] - 10https://gerrit.wikimedia.org/r/994120 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:54:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Peachey88) [10:54:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM too" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [10:56:07] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [10:56:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-airflow1005.eqiad.wmnet with OS bullseye [10:59:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:59:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:59:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T343718)', diff saved to https://phabricator.wikimedia.org/P55892 and previous config saved to /var/cache/conftool/dbconfig/20240130-105954-ladsgroup.json [11:00:00] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T1100) [11:00:05] (03PS2) 10Gmodena: eventstreams: service version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) [11:02:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T343718)', diff saved to https://phabricator.wikimedia.org/P55893 and previous config saved to /var/cache/conftool/dbconfig/20240130-110207-ladsgroup.json [11:02:54] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::airflow::search [11:04:18] (03PS1) 10Muehlenhoff: Switch airflow/search to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/994153 (https://phabricator.wikimedia.org/T349619) [11:06:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [11:06:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [11:06:47] (03CR) 10Btullis: [C: 03+2] Update the spark-operator image name and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993012 (https://phabricator.wikimedia.org/T354273) (owner: 10Btullis) [11:07:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch airflow/search to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/994153 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:07:31] (03CR) 10David Caro: P:toolforge: mailrelay: workaround Exim 4.94 taints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) (owner: 10Majavah) [11:07:34] (03Abandoned) 10Btullis: Use insetup::buster for the old namenodes [puppet] - 10https://gerrit.wikimedia.org/r/990665 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:08:28] (03Abandoned) 10Btullis: Add data for the new an-master100[3-4] [puppet] - 10https://gerrit.wikimedia.org/r/989901 (https://phabricator.wikimedia.org/T332573) (owner: 10Btullis) [11:09:31] (03PS6) 10Btullis: Update default airflow_version and remove overrides [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621) [11:09:52] (03Merged) 10jenkins-bot: Update the spark-operator image name and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993012 (https://phabricator.wikimedia.org/T354273) (owner: 10Btullis) [11:10:06] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [11:10:49] (03CR) 10Btullis: [C: 03+1] hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/993743 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene) [11:11:02] PROBLEM - Check systemd state on debmonitor1003 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:10] 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), 10Performance Issue: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Midleading) Keep your words, don't delete everything at once. I can easily imagine a situation when I would re-vi... [11:11:27] (03CR) 10Btullis: [C: 03+1] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/993742 (https://phabricator.wikimedia.org/T353776) (owner: 10Stevemunene) [11:11:53] (03CR) 10Btullis: [C: 03+1] hadoop:httpd: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/993075 (owner: 10Muehlenhoff) [11:12:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::airflow::search [11:13:10] (03CR) 10Btullis: [C: 03+1] "Great, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/993065 (https://phabricator.wikimedia.org/T349936) (owner: 10Muehlenhoff) [11:13:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10MatthewVernon) [11:14:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Data-Persistence, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10MatthewVernon) [I'll want to check afterwards that the ms-be nodes are happy, but this shouldn't be an issue] [11:16:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10MatthewVernon) [11:17:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P55894 and previous config saved to /var/cache/conftool/dbconfig/20240130-111713-ladsgroup.json [11:17:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:18:20] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/Debmonitor [11:19:06] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10MatthewVernon) swift will need depooling in codfw before this work. Likewise the affected thanos-fe node. I... [11:19:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10MatthewVernon) [11:19:37] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1157.eqiad.wmnet [11:20:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:20:39] (ProbeDown) firing: (10) Service debmonitor1002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:46] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor1003 is CRITICAL: connect to address 10.64.32.12 and port 7443: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [11:21:08] (03CR) 10Majavah: [V: 03+1] P:toolforge: mailrelay: workaround Exim 4.94 taints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) (owner: 10Majavah) [11:21:20] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10MatthewVernon) Once complete, I'll want to check the ms-be nodes are all happy (shouldn't be an issue). [11:21:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10MatthewVernon) [11:21:58] slyngs: FYI ^^^ (debmonitor alerts) [11:22:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [11:22:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [11:22:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10MatthewVernon) Once complete I'll want to check the backends, but this shouldn't be an issue. [11:23:10] PROBLEM - debmonitor.wikimedia.org:7443 CDN SSL Expiry on debmonitor1003 is CRITICAL: connect to address 10.64.32.12 and port 7443: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [11:23:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 (10MatthewVernon) [11:25:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 (10MatthewVernon) The affected thanos frontend will need depooling. Similarly, swift in codfw will need depooling. [11:26:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 (10MatthewVernon) [11:27:28] (03CR) 10David Caro: P:toolforge: mailrelay: workaround Exim 4.94 taints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) (owner: 10Majavah) [11:27:35] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 (10MatthewVernon) I'll want to check the backends once this work is complete, but it shouldn't be an issue. [11:28:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-airflow1005.eqiad.wmnet [11:29:04] (03CR) 10David Caro: [C: 03+1] "LGTM, not an expert though" [puppet] - 10https://gerrit.wikimedia.org/r/993697 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [11:30:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:30:38] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1157.eqiad.wmnet [11:31:08] 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), 10Performance Issue: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) Thumbs that are being used get cached in the CDN in any case. [11:32:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P55895 and previous config saved to /var/cache/conftool/dbconfig/20240130-113220-ladsgroup.json [11:32:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1005.eqiad.wmnet [11:42:09] (03PS1) 10Majavah: aptrepo: add component/exim4-arc [puppet] - 10https://gerrit.wikimedia.org/r/994157 (https://phabricator.wikimedia.org/T356171) [11:42:49] (03PS4) 10Ilias Sarantopoulos: admin_ng: elevate ml users experimental permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) [11:44:46] (03CR) 10Ilias Sarantopoulos: admin_ng: elevate ml users experimental permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [11:47:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T343718)', diff saved to https://phabricator.wikimedia.org/P55896 and previous config saved to /var/cache/conftool/dbconfig/20240130-114726-ladsgroup.json [11:47:33] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:48:09] (03CR) 10Hashar: [C: 03+1] [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [12:01:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994157 (https://phabricator.wikimedia.org/T356171) (owner: 10Majavah) [12:01:53] (03CR) 10Majavah: [C: 03+2] aptrepo: add component/exim4-arc [puppet] - 10https://gerrit.wikimedia.org/r/994157 (https://phabricator.wikimedia.org/T356171) (owner: 10Majavah) [12:12:09] (03CR) 10Slyngshede: P:debmonitor::server rework debmonitor http monitoring. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:18:39] (03PS2) 10Majavah: P:toolforge::mailrelay: add Authentication-Results header [puppet] - 10https://gerrit.wikimedia.org/r/993697 (https://phabricator.wikimedia.org/T354112) [12:19:18] !log reprepro import exim4 4.96-15+deb12u4+wmf1 to component/exim4-arc T356171 [12:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:24] T356171: Enable ARC support in Toolforge - https://phabricator.wikimedia.org/T356171 [12:21:05] (03PS1) 10Slyngshede: P:debmonitor::server link uwsgi app ini [puppet] - 10https://gerrit.wikimedia.org/r/994161 [12:22:37] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1239/co" [puppet] - 10https://gerrit.wikimedia.org/r/994161 (owner: 10Slyngshede) [12:23:02] (03PS2) 10Slyngshede: P:debmonitor::server link uwsgi app ini [puppet] - 10https://gerrit.wikimedia.org/r/994161 [12:24:27] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1240/co" [puppet] - 10https://gerrit.wikimedia.org/r/994161 (owner: 10Slyngshede) [12:26:08] (03CR) 10Majavah: [C: 03+2] P:toolforge::mailrelay: add Authentication-Results header [puppet] - 10https://gerrit.wikimedia.org/r/993697 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [12:27:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1241/console" [puppet] - 10https://gerrit.wikimedia.org/r/994161 (owner: 10Slyngshede) [12:36:45] (03PS3) 10Gmodena: eventstreams: service version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) [12:38:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/994161 (owner: 10Slyngshede) [12:38:18] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:debmonitor::server link uwsgi app ini [puppet] - 10https://gerrit.wikimedia.org/r/994161 (owner: 10Slyngshede) [12:40:52] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 680 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [12:42:59] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: make 5 jobrunners kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/993714 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:44:22] (ProbeDown) firing: (8) Service debmonitor1003:443 has failed probes (http_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:45:58] 10SRE, 10ops-eqiad: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T354499 (10hnowlan) Failing disk: ` root@aqs1013:/home/hnowlan# udevadm info --query=all --name=/dev/sde| grep SERIAL E: ID_SERIAL=MZ7KH1T9HAJR0D3_S4KVNA0MB04213 E: ID_SERIAL_SHORT=S4KVNA0MB04213 root@aqs1013:/home/hnowlan# dmesg... [12:46:48] (03PS1) 10Majavah: Remove Brooke's root key [labs/private] - 10https://gerrit.wikimedia.org/r/994162 [12:48:47] (03PS1) 10Majavah: Add fake ARC signing keys [labs/private] - 10https://gerrit.wikimedia.org/r/994163 (https://phabricator.wikimedia.org/T354112) [12:49:18] (03CR) 10Majavah: [V: 03+2 C: 03+2] Add fake ARC signing keys [labs/private] - 10https://gerrit.wikimedia.org/r/994163 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [12:52:57] (03PS1) 10Slyngshede: P:firewall absent conntrack_table_size monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/994164 (https://phabricator.wikimedia.org/T350694) [12:55:01] (03PS1) 10Ayounsi: ip_v4.address is actually a str but we expect an ipaddress object [cookbooks] - 10https://gerrit.wikimedia.org/r/994165 [12:55:18] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) [12:55:31] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:56:24] (03PS3) 10Majavah: P:toolforge: mailrelay: workaround Exim 4.94 taints [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) [12:56:26] (03PS1) 10Majavah: exim4: Allow installing from component [puppet] - 10https://gerrit.wikimedia.org/r/994166 [12:56:29] (03PS1) 10Majavah: P:toolforge::mailrelay: ARC sign outbound mail [puppet] - 10https://gerrit.wikimedia.org/r/994167 [12:57:27] (03CR) 10Muehlenhoff: [C: 03+1] Remove Brooke's root key [labs/private] - 10https://gerrit.wikimedia.org/r/994162 (owner: 10Majavah) [12:57:28] (03PS1) 10Btullis: Enable the presto_nested_data feature on superset-next [puppet] - 10https://gerrit.wikimedia.org/r/994168 (https://phabricator.wikimedia.org/T340144) [12:57:42] (03PS6) 10Arnaudb: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [12:58:01] (03CR) 10Arnaudb: admin: add amastilovic to analytics-privatedata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [12:58:17] (03PS1) 10Slyngshede: P:url_downloader absent Icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/994170 (https://phabricator.wikimedia.org/T350694) [12:58:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/994168 (https://phabricator.wikimedia.org/T340144) (owner: 10Btullis) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T1300) [13:02:46] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/994165 (owner: 10Ayounsi) [13:02:59] (03CR) 10Brouberol: [C: 03+1] Enable the presto_nested_data feature on superset-next [puppet] - 10https://gerrit.wikimedia.org/r/994168 (https://phabricator.wikimedia.org/T340144) (owner: 10Btullis) [13:03:01] (03CR) 10Ayounsi: [C: 03+2] ip_v4.address is actually a str but we expect an ipaddress object [cookbooks] - 10https://gerrit.wikimedia.org/r/994165 (owner: 10Ayounsi) [13:03:10] (03CR) 10Btullis: [V: 03+1 C: 03+2] Enable the presto_nested_data feature on superset-next [puppet] - 10https://gerrit.wikimedia.org/r/994168 (https://phabricator.wikimedia.org/T340144) (owner: 10Btullis) [13:04:17] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1158.eqiad.wmnet [13:05:20] (03CR) 10Majavah: [V: 03+2 C: 03+2] Remove Brooke's root key [labs/private] - 10https://gerrit.wikimedia.org/r/994162 (owner: 10Majavah) [13:05:32] (03CR) 10Muehlenhoff: "Looks good (once the key has been confirmed out-of-band). You can also go ahead and create the Kerberos principal already:" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [13:05:35] (03CR) 10Muehlenhoff: [C: 03+1] admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [13:06:59] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1158.eqiad.wmnet [13:08:34] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host srestest2005.codfw.wmnet [13:08:35] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:08:45] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1159-1175].eqiad.wmnet [13:08:56] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker[1159-1175].eqiad.wmnet [13:10:24] (03PS1) 10Slyngshede: P:systemd::timesyncd absent monitoring, handled by AlertManager [puppet] - 10https://gerrit.wikimedia.org/r/994172 (https://phabricator.wikimedia.org/T350694) [13:10:59] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [13:11:55] (03CR) 10Klausman: admin_ng: elevate ml users experimental permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [13:12:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [13:12:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:12:26] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache srestest2005.codfw.wmnet on all recursors [13:12:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) srestest2005.codfw.wmnet on all recursors [13:12:52] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:15:16] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [13:15:22] (03PS1) 10Ayounsi: Add routed ganeti cluster to Netbox sync jobs [puppet] - 10https://gerrit.wikimedia.org/r/994173 (https://phabricator.wikimedia.org/T300152) [13:16:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [13:16:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:16:08] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache srestest2005.codfw.wmnet on all recursors [13:16:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) srestest2005.codfw.wmnet on all recursors [13:16:12] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host srestest2005.codfw.wmnet [13:19:52] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994173 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:20:25] (03CR) 10Ayounsi: [C: 03+2] Add routed ganeti cluster to Netbox sync jobs [puppet] - 10https://gerrit.wikimedia.org/r/994173 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:23:55] (03PS2) 10Majavah: P:openstack: neutron: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991773 (https://phabricator.wikimedia.org/T355417) [13:26:42] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host srestest2005.codfw.wmnet [13:26:43] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:31:20] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [13:32:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [13:32:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:32:12] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache srestest2005.codfw.wmnet on all recursors [13:32:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) srestest2005.codfw.wmnet on all recursors [13:32:41] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [13:32:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/994167 (owner: 10Majavah) [13:33:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM srestest2005.codfw.wmnet - ayounsi@cumin1002" [13:33:39] (03PS2) 10Majavah: exim4: Allow installing from component [puppet] - 10https://gerrit.wikimedia.org/r/994166 [13:33:41] (03PS2) 10Majavah: P:toolforge::mailrelay: ARC sign outbound mail [puppet] - 10https://gerrit.wikimedia.org/r/994167 [13:33:49] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Count_Count) The banner says 8:30 to 8:40 UTC. I was confused that it was still up. [13:34:51] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=94) for new host srestest2005.codfw.wmnet [13:36:05] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts srestest2005.codfw.wmnet [13:37:36] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1157-1175].eqiad.wmnet [13:38:41] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T356138 (10Papaul) 05Open→03Resolved a:03Papaul [13:39:57] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:44:24] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: srestest2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [13:45:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: srestest2005.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [13:45:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:19] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts srestest2005.codfw.wmnet [13:45:26] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `srestest2005.codfw.wmnet` - srestest2005.codfw.wmnet (**WARN**) - //Host... [13:47:38] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host sretest2005.codfw.wmnet [13:47:39] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:48:14] (03PS1) 10Bartosz Dziewoński: CommentParser: Ignore generated timestamp links [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994139 (https://phabricator.wikimedia.org/T356142) [13:48:40] (03PS1) 10Bartosz Dziewoński: CommentParser: Ignore generated timestamp links [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994140 (https://phabricator.wikimedia.org/T356142) [13:49:00] (03PS1) 10Bartosz Dziewoński: Add maintenance script to list users with invalid signatures [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994141 (https://phabricator.wikimedia.org/T356168) [13:53:40] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM sretest2005.codfw.wmnet - ayounsi@cumin1002" [13:54:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM sretest2005.codfw.wmnet - ayounsi@cumin1002" [13:54:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:33] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2005.codfw.wmnet on all recursors [13:54:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2005.codfw.wmnet on all recursors [13:55:01] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM sretest2005.codfw.wmnet - ayounsi@cumin1002" [13:55:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM sretest2005.codfw.wmnet - ayounsi@cumin1002" [13:56:37] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [13:58:18] (03CR) 10Arnaudb: "thanks for the link :-) it's been done as well, we just need the ssh-key confirmation" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T1400). [14:00:05] WMDE-Fisch, MatmaRex, and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] Hi :) [14:00:39] hi [14:01:28] \o [14:03:28] (03PS4) 10Gmodena: eventstreams: service version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) [14:03:40] seven changes, four of them backports… [14:03:52] * Lucas_WMDE gazes into a crystal ball and predicts not all of them will be deployed in an hour [14:03:54] but I can deploy :) [14:04:03] Lucas_WMDE: if you do the backports in batches, it is doable [14:04:07] i would do it, but in a meeting [14:04:20] Mine can probably be deployed together (at least two) [14:04:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994028 (https://phabricator.wikimedia.org/T355933) (owner: 10WMDE-Fisch) [14:04:38] let’s start with WMDE-Fisch [14:04:45] \o/ [14:05:13] Also in a meeting but can test [14:06:46] you can give the other backports CR+2 now, so that they start running the CI jobs and merging [14:06:53] MatmaRex: is the new maintenance script testable on mwdebug? [14:06:59] they're all on different repos and branches actually, which is convenient [14:07:01] because if not, I might as well roll that out together with the other fix on that wmf branch [14:07:08] and also scap backport supports multiple change IDs :) [14:07:09] ah, it’s not the same repo anyway [14:07:24] but I think it could still be scap backported with the other change in that wmf branch [14:07:31] Lucas_WMDE: not really testable on mwdebug [14:07:38] yeah, I thought so [14:07:47] i mean, you could run it on mwdebug i guess, if you really wanted? i'm not sure if that's possible/allowed in production :P [14:08:04] does scap backport support deploying changes from different wmf branches at the same time? [14:08:08] i tried it out on the beta cluster a few moments ago though [14:08:08] yes [14:08:30] ok, then I guess there’s no strong reason to separate the wmf.15 and 16 backports of the same change either [14:08:53] yup [14:08:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994139 (https://phabricator.wikimedia.org/T356142) (owner: 10Bartosz Dziewoński) [14:09:00] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994140 (https://phabricator.wikimedia.org/T356142) (owner: 10Bartosz Dziewoński) [14:09:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994141 (https://phabricator.wikimedia.org/T356168) (owner: 10Bartosz Dziewoński) [14:09:14] * Lucas_WMDE is sometimes still in a “manual `scap sync-file`” mindset :) [14:09:18] (03CR) 10TChin: [C: 03+1] eventstreams: service version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) (owner: 10Gmodena) [14:09:24] !log volans@cumin2002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:09:25] (03Merged) 10jenkins-bot: Don't bail out early when there are no selectors configured [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994028 (https://phabricator.wikimedia.org/T355933) (owner: 10WMDE-Fisch) [14:09:51] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:994028|Don't bail out early when there are no selectors configured (T355933)]] [14:09:56] T355933: Reference Previews not showing when users disable page previews in settings - https://phabricator.wikimedia.org/T355933 [14:10:05] * Lucas_WMDE looks at Superpes’ changes [14:10:29] I'm rebasing them so you can deploy them together [14:10:31] (03CR) 10Gmodena: [C: 03+2] eventstreams: service version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) (owner: 10Gmodena) [14:11:15] (03PS2) 10Superpes15: [azwiki] Changing 9 namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993452 (https://phabricator.wikimedia.org/T355041) [14:11:16] !log volans@cumin2002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:11:22] (03Merged) 10jenkins-bot: eventstreams: service version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/992814 (https://phabricator.wikimedia.org/T354456) (owner: 10Gmodena) [14:11:27] !log lucaswerkmeister-wmde@deploy2002 wmde-fisch and lucaswerkmeister-wmde: Backport for [[gerrit:994028|Don't bail out early when there are no selectors configured (T355933)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:11:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [azwiki] Changing 9 namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993452 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15) [14:12:20] WMDE-Fisch: can you test the change? [14:12:28] (03PS2) 10Superpes15: [enwiktionary] Remove the Concordance namespace and its talk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993457 (https://phabricator.wikimedia.org/T354813) [14:12:29] Lucas_WMDE: Just tested it. Works on mwdebug. [14:12:32] !log lucaswerkmeister-wmde@deploy2002 wmde-fisch and lucaswerkmeister-wmde: Continuing with sync [14:12:35] ok, thanks! [14:12:51] (03PS2) 10Superpes15: [enwikiquote] Add a draft namespace and its talk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993458 (https://phabricator.wikimedia.org/T355195) [14:13:04] * Lucas_WMDE tries to find out if there are special considerations for deleting a namespace [14:13:21] (even if it’s empty, what happens to e.g. the deletion log entries when it’s deleted?) [14:13:22] Lucas_WMDE Me too... they already deleted another one lol [14:14:25] Lucas_WMDE: displays as special:Badtitle AFAIK [14:14:49] apparently T298315 didn’t go so well [14:14:51] T298315: Deleting Ns:104 in it:voy - https://phabricator.wikimedia.org/T298315 [14:15:12] (03Merged) 10jenkins-bot: CommentParser: Ignore generated timestamp links [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994139 (https://phabricator.wikimedia.org/T356142) (owner: 10Bartosz Dziewoński) [14:15:18] (03Merged) 10jenkins-bot: CommentParser: Ignore generated timestamp links [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994140 (https://phabricator.wikimedia.org/T356142) (owner: 10Bartosz Dziewoński) [14:15:37] Oh my lol [14:16:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10Volans) p:05Triage→03Medium @cmooney did you had a chance to test the above failure scenario? AFAICT is still happening [14:16:49] T285766 looks more promising [14:16:49] T285766: Remove the Book namespace from enwiki - https://phabricator.wikimedia.org/T285766 [14:16:50] T344816 This one doesn't cause any trouble (apparently) [14:16:51] T344816: Delete the Index namespace at English Wiktionary - https://phabricator.wikimedia.org/T344816 [14:17:03] (On same projects) [14:17:54] okay, as long as there are no remaining pages in there (there aren’t), it looks like it should be okay [14:18:55] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:994028|Don't bail out early when there are no selectors configured (T355933)]] (duration: 09m 04s) [14:19:01] T355933: Reference Previews not showing when users disable page previews in settings - https://phabricator.wikimedia.org/T355933 [14:19:41] (03PS1) 10Cmelo: Add WikimediaCampaignEvents to extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994176 (https://phabricator.wikimedia.org/T347894) [14:20:36] !log lucaswerkmeister-wmde@deploy2002 backport Cancelled [14:20:37] Lucas_WMDE: Works like a charm. Thanks! /me Done. [14:20:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994141 (https://phabricator.wikimedia.org/T356168) (owner: 10Bartosz Dziewoński) [14:20:49] WMDE-Fisch: \o/ [14:20:55] MatmaRex: backporting your changes now [14:21:06] thanks [14:21:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Should be okay as far as I can tell." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993457 (https://phabricator.wikimedia.org/T354813) (owner: 10Superpes15) [14:22:05] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10xcollazo) All right, that one worked as expected. Let's make the change permanent. Thanks @Dzahn ! [14:25:31] (03PS1) 10Jgiannelos: mobileapps: Switchover PCS to core page HTML [deployment-charts] - 10https://gerrit.wikimedia.org/r/994177 (https://phabricator.wikimedia.org/T339865) [14:26:05] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:26:32] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [14:29:27] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: Bump mesh modules minor versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994146 (https://phabricator.wikimedia.org/T351567) (owner: 10Alexandros Kosiaris) [14:29:41] (03CR) 10Lucas Werkmeister (WMDE): [enwikiquote] Add a draft namespace and its talk space (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993458 (https://phabricator.wikimedia.org/T355195) (owner: 10Superpes15) [14:30:07] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:30:09] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [14:30:27] (03Merged) 10jenkins-bot: Add maintenance script to list users with invalid signatures [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994141 (https://phabricator.wikimedia.org/T356168) (owner: 10Bartosz Dziewoński) [14:30:54] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:994139|CommentParser: Ignore generated timestamp links (T356142)]], [[gerrit:994140|CommentParser: Ignore generated timestamp links (T356142)]], [[gerrit:994141|Add maintenance script to list users with invalid signatures (T356168)]] [14:30:58] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [14:31:00] T356142: Comment-not-found notification is shown also if the comment is not deleted - https://phabricator.wikimedia.org/T356142 [14:31:00] T356168: Add a way to generate a list of users with an invalid signatures - https://phabricator.wikimedia.org/T356168 [14:31:44] !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [14:32:13] (03PS3) 10Superpes15: [enwikiquote] Add a draft namespace and its talk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993458 (https://phabricator.wikimedia.org/T355195) [14:32:26] !log lucaswerkmeister-wmde@deploy2002 matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:994139|CommentParser: Ignore generated timestamp links (T356142)]], [[gerrit:994140|CommentParser: Ignore generated timestamp links (T356142)]], [[gerrit:994141|Add maintenance script to list users with invalid signatures (T356168)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:32:28] !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [14:32:45] MatmaRex: can you test the CommentParser part on mwdebug? [14:33:22] yup, looking [14:33:23] and I’d leave running the maintenance script until the end of the window, if that’s okay [14:34:23] (03CR) 10Superpes15: [enwikiquote] Add a draft namespace and its talk space (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993458 (https://phabricator.wikimedia.org/T355195) (owner: 10Superpes15) [14:35:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Alright, should be good to go I think :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993458 (https://phabricator.wikimedia.org/T355195) (owner: 10Superpes15) [14:35:25] Lucas_WMDE: looks good, tested on https://www.mediawiki.org/wiki/User_talk:Matma_Rex [14:35:33] alright, thanks! [14:35:35] !log lucaswerkmeister-wmde@deploy2002 matmarex and lucaswerkmeister-wmde: Continuing with sync [14:36:34] (03PS1) 10Mhorsey: Update commonsettings-labs to enable WikimediaCampaignEvents extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994179 (https://phabricator.wikimedia.org/T347894) [14:38:02] 10SRE: operations/deployment-charts can't be checked out on Windows due to a file named aux.yaml - https://phabricator.wikimedia.org/T356185 (10matmarex) [14:38:28] just thinking out loud… I wonder if `scap backport` should print a reminder to run namespaceDupes after syncing a change that touched `core-Namespaces.php` [14:38:39] (which is possible now that it’s not all mushed together in IS.php anymore, yay) [14:38:54] because I think I might have forgotten it a few days ago, and just remembered I should do it for the config changes today [14:38:58] 10SRE: operations/deployment-charts can't be checked out on Windows due to a file named aux.yaml - https://phabricator.wikimedia.org/T356185 (10matmarex) (the file was added in 80e5bb3282d311af8608fc9c250828e2bf330df3) [14:39:25] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:58] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:41:28] (03PS1) 10Filippo Giunchedi: oauth2-proxy: run as nobody or explicit uid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994182 (https://phabricator.wikimedia.org/T320555) [14:41:39] MatmaRex: re T356185 – apparently https://en.wikipedia.org/wiki/ISO_639:aux exists… imagine if one day we had i18n/aux.json files in tons of extensions ._. [14:41:40] T356185: operations/deployment-charts can't be checked out on Windows due to a file named aux.yaml - https://phabricator.wikimedia.org/T356185 [14:41:41] (03CR) 10Jgiannelos: "Removes staging specific config entry and make it global to PCS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994177 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [14:41:48] though it’s apparently extinct [14:41:55] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:994139|CommentParser: Ignore generated timestamp links (T356142)]], [[gerrit:994140|CommentParser: Ignore generated timestamp links (T356142)]], [[gerrit:994141|Add maintenance script to list users with invalid signatures (T356168)]] (duration: 11m 01s) [14:42:03] T356142: Comment-not-found notification is shown also if the comment is not deleted - https://phabricator.wikimedia.org/T356142 [14:42:03] T356168: Add a way to generate a list of users with an invalid signatures - https://phabricator.wikimedia.org/T356168 [14:42:04] but there appears to be an alive NUL language, IIRC that’s also verboten on windows https://en.wikipedia.org/wiki/Nusa_Laut_language [14:42:05] Lucas_WMDE: hah [14:42:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993452 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15) [14:43:20] (03Merged) 10jenkins-bot: [azwiki] Changing 9 namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993452 (https://phabricator.wikimedia.org/T355041) (owner: 10Superpes15) [14:43:44] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:993452|[azwiki] Changing 9 namespace aliases (T355041)]] [14:43:49] T355041: Creation of namespace abbreviations in Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T355041 [14:45:09] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Backport for [[gerrit:993452|[azwiki] Changing 9 namespace aliases (T355041)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:45:16] Superpes: please test azwiki [14:46:02] change at https://az.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&formatversion=2 looks good to me, at least [14:46:04] Lucas_WMDE It works! [14:46:07] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Continuing with sync [14:47:37] (03PS1) 10Clément Goubert: kubernetes: make 3 api_appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/994183 (https://phabricator.wikimedia.org/T351074) [14:50:49] 1 pages to fix, 1 were resolvable. ; and 477 links to fix, 476 were resolvable, 0 were deleted. [14:50:59] I’ll do the non-dry-run once scap is done [14:51:15] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [14:51:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993457 (https://phabricator.wikimedia.org/T354813) (owner: 10Superpes15) [14:52:21] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:993452|[azwiki] Changing 9 namespace aliases (T355041)]] (duration: 08m 37s) [14:52:26] T355041: Creation of namespace abbreviations in Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T355041 [14:52:35] (03Merged) 10jenkins-bot: [enwiktionary] Remove the Concordance namespace and its talk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993457 (https://phabricator.wikimedia.org/T354813) (owner: 10Superpes15) [14:52:39] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes azwiki --fix # T355041, failed at the end :( [14:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:47] Oh [14:53:04] presumably the fix for T355628 isn’t on wmf.15 yet [14:53:05] T355628: namespaceDupes.php crashed when trying to fix templatelinks - https://phabricator.wikimedia.org/T355628 [14:53:09] oops, wrong task [14:53:13] T341993 [14:53:14] T341993: namespaceDupes.php can fail if new target does not have a linktarget entry - https://phabricator.wikimedia.org/T341993 [14:53:18] yeah, that’s wmf.16 only [14:53:18] meh [14:53:53] Ah :// [14:54:58] not a problem, I think [14:55:04] I left a comment on the phab task [14:55:32] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:993457|[enwiktionary] Remove the Concordance namespace and its talk space (T354813)]] [14:55:38] T354813: Delete the Concordance namespace at English Wiktionary - https://phabricator.wikimedia.org/T354813 [14:55:51] Yep, if they failed at the end, there is probably very little to fix [14:56:10] Ah 3 links! Wonderful :D [14:56:44] (03PS1) 10Volans: postgres backups: add hard link for latest [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) [14:57:02] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Backport for [[gerrit:993457|[enwiktionary] Remove the Concordance namespace and its talk space (T354813)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:57:12] alright, please test enwiktionary next :) [14:57:23] (03PS5) 10Ilias Sarantopoulos: admin_ng: elevate ml users experimental permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) [14:57:58] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) Thanks - I can reach the host. I will take it from here. Thank you! [14:58:05] (03CR) 10Ilias Sarantopoulos: admin_ng: elevate ml users experimental permissions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [14:58:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994146 (https://phabricator.wikimedia.org/T351567) (owner: 10Alexandros Kosiaris) [14:58:24] https://en.wiktionary.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&formatversion=2 difference looks good to me [14:59:04] Yep it works, sorry my interned crashed for 2 minutes, nss are removed Lucas_WMDE [14:59:04] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [14:59:10] !log lucaswerkmeister-wmde@deploy2002 superpes and lucaswerkmeister-wmde: Continuing with sync [14:59:10] *Internet [14:59:25] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:28] (03CR) 10Volans: "Tested on a netbox DB host" [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [14:59:42] (03Merged) 10jenkins-bot: mediawiki: Bump mesh modules minor versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994146 (https://phabricator.wikimedia.org/T351567) (owner: 10Alexandros Kosiaris) [15:00:17] (03PS2) 10Volans: postgres backups: add hard link for latest [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) [15:00:26] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [15:00:36] urbanecm was right, https://en.wiktionary.org/w/index.php?title=Special:Log&logid=52605113 shows Special:Badtitle on mwdebug [15:00:52] including the namespace number, isn’t that nice [15:01:19] (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992660 (owner: 10PipelineBot) [15:01:21] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992661 (owner: 10PipelineBot) [15:01:26] jouncebot: now [15:01:26] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [15:01:35] Superpes: if you have a bit more time, I think we can still do the enwikiquote change [15:01:44] Yep yep no problem for me :) [15:01:48] ok :) [15:01:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] [enwikiquote] Add a draft namespace and its talk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993458 (https://phabricator.wikimedia.org/T355195) (owner: 10Superpes15) [15:02:00] I'm sorry if I keep you longer than expected [15:02:38] (03Merged) 10jenkins-bot: [enwikiquote] Add a draft namespace and its talk space [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993458 (https://phabricator.wikimedia.org/T355195) (owner: 10Superpes15) [15:02:42] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992661 (owner: 10PipelineBot) [15:03:23] enwiktionary apparently has one link to fix, huh [15:03:34] (not related to this change, afaict) [15:03:45] I’ll do the --fix after the php-fpm-restart is done [15:05:20] apparently https://en.wiktionary.org/w/index.php?curid=657941 is testing… transcluding the wikipedia user page? huh [15:05:29] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:993457|[enwiktionary] Remove the Concordance namespace and its talk space (T354813)]] (duration: 09m 57s) [15:05:30] that might be the reason for “templatelinks from=657941 ns=0 dbk=User:Eirikr -> User:Eirikr DRY RUN” [15:05:31] idk [15:05:36] T354813: Delete the Concordance namespace at English Wiktionary - https://phabricator.wikimedia.org/T354813 [15:06:25] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes enwiktionary --fix # T354813 [15:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:41] !log Manual run of mediawiki_job_generatecaptcha.service following timer failure - T141490 [15:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:46] T141490: Deploy improved FancyCaptcha - https://phabricator.wikimedia.org/T141490 [15:07:18] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:993458|[enwikiquote] Add a draft namespace and its talk space (T355195)]] [15:07:21] (03CR) 10BCornwall: "Hi, sukhe/Volans. Since you two had a meeting about this, what do you two prefer this patch look like?" [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [15:07:23] T355195: Create a Draft namespace on English Wikiquote - https://phabricator.wikimedia.org/T355195 [15:08:10] (03CR) 10Hnowlan: [C: 03+1] kubernetes: make 3 api_appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/994183 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:08:32] Ah yep it could be [15:08:45] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and superpes: Backport for [[gerrit:993458|[enwikiquote] Add a draft namespace and its talk space (T355195)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:09:26] Ok it works! [15:09:32] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and superpes: Continuing with sync [15:12:26] (03CR) 10Jforrester: [C: 03+1] "https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/57 landed three weeks ago so this is now safe to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994176 (https://phabricator.wikimedia.org/T347894) (owner: 10Cmelo) [15:15:28] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:52] ok, looks like two pages will need a prefix [15:16:02] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:993458|[enwikiquote] Add a draft namespace and its talk space (T355195)]] (duration: 08m 43s) [15:16:07] T355195: Create a Draft namespace on English Wikiquote - https://phabricator.wikimedia.org/T355195 [15:16:12] although I dton’t think that’s due to the draft namespace, actually? [15:16:21] Wq:SheSaid and Wq:shesaid [15:16:27] maybe a Wq namespace alias was introduced? [15:16:57] Uhm looking [15:17:04] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [15:17:04] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host sretest2005.codfw.wmnet [15:17:13] Surely draft is not involved [15:17:14] !log Recomissioning mw2366.codfw.wmnet,mw2368.codfw.wmnet,mw2370.codfw.wmnet as k8s nodes - T351074 [15:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:19] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [15:17:41] !log lucaswerkmeister-wmde@mwmaint2002:~$ mwscript namespaceDupes enwikiquote --fix # T355195 (two pages will need separate fixing) [15:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:52] Uhm enwikiquote should not have any aliases [15:18:30] I think it’s from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/734383 [15:18:38] if the namespace is case-insensitive [15:18:50] yeah, seems to be [15:19:02] then it’s understandable that namespaceDupes wasn’t run on all affected wikis at the time ^^ [15:19:21] (03CR) 10Hnowlan: [C: 03+2] kubernetes: make 5 jobrunners kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/993714 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:19:32] (03CR) 10Clément Goubert: [C: 03+2] kubernetes: make 3 api_appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/994183 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [15:21:15] (03CR) 10Btullis: [C: 03+1] "All good from my point of view. Thanks volans." [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [15:21:26] grmbl, I can’t get the text of those pages with getText.php because there’s no way to refer to them [15:21:29] even if I specify the revision [15:21:49] aha, but https://en.wikiquote.org/w/index.php?oldid=2910714 works! [15:22:02] and https://en.wikiquote.org/w/index.php?oldid=2906032 is the other one [15:22:06] both are just redirects [15:22:09] (03CR) 10Jcrespo: "I don't have any feedback for this except -let's test that it works well after deployment to ensure both the script and bacula can recover" [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [15:22:49] (03CR) 10Bking: [C: 03+2] cloudelastic: Add migration canary to cloudelastic cluster [puppet] - 10https://gerrit.wikimedia.org/r/993764 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [15:25:04] (03CR) 10Volans: "Sure! This should be safe to merge and we can use netbox-dev2002 as test host for the bacula side of things." [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [15:25:11] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2366.codfw.wmnet with OS bullseye [15:25:48] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2368.codfw.wmnet with OS bullseye [15:26:04] posted on the task about it [15:26:10] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw2370.codfw.wmnet with OS bullseye [15:26:11] anyway, I think that’s it as far as deployment goes [15:26:16] !log UTC afternoon backport+config window done [15:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:30] and now the maintenance script for MatmaRex ^^ [15:26:39] only on enwiki, I take it? [15:27:49] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1440.eqiad.wmnet with OS bullseye [15:28:01] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1440.eqiad.wmnet with OS bullseye [15:28:27] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1457.eqiad.wmnet with OS bullseye [15:28:33] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1482.eqiad.wmnet with OS bullseye [15:28:35] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1459.eqiad.wmnet with OS bullseye [15:28:39] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1457.eqiad.wmnet with OS bullseye [15:28:41] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1466.eqiad.wmnet with OS bullseye [15:28:45] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1482.eqiad.wmnet with OS bullseye [15:28:49] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1459.eqiad.wmnet with OS bullseye [15:28:53] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1466.eqiad.wmnet with OS bullseye [15:29:11] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript CheckSignatures enwiki | tee T356168 [15:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:16] T356168: Add a way to generate a list of users with an invalid signatures - https://phabricator.wikimedia.org/T356168 [15:29:25] that’s a considerable amount of output already [15:29:38] like, a dozen or two per second [15:29:46] Yep Lucas_WMDE This is the issue! They just have to delete the redirects :) [15:29:49] (rough estimate) [15:29:53] Btw thanks for your assistance! :D [15:29:57] np :) [15:30:37] (03Abandoned) 10Ssingh: dnsrecursor: use validate_cmd for pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/937139 (owner: 10Ssingh) [15:33:31] MatmaRex: fyi, the maint script is currently at user_id ~1M out of 47M, and has found ~10k users so far [15:33:35] 10SRE: operations/deployment-charts can't be checked out on Windows due to a file named aux.yaml - https://phabricator.wikimedia.org/T356185 (10jhathaway) a:03jhathaway [15:33:53] so if the rate keeps up that’s a very rough estimate of half a million users with invalid signatures [15:34:03] that’s a lot of pings ._. [15:34:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:34:54] ^ that's me [15:40:55] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617 [15:41:01] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [15:41:36] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1459.eqiad.wmnet with reason: host reimage [15:41:37] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1482.eqiad.wmnet with reason: host reimage [15:41:57] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2368.codfw.wmnet with reason: host reimage [15:41:58] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1440.eqiad.wmnet with reason: host reimage [15:42:02] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2366.codfw.wmnet with reason: host reimage [15:42:07] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1466.eqiad.wmnet with reason: host reimage [15:42:08] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1457.eqiad.wmnet with reason: host reimage [15:42:32] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2370.codfw.wmnet with reason: host reimage [15:44:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1459.eqiad.wmnet with reason: host reimage [15:45:42] the ~1% rate of invalid signatures seems to roughly hold up so far fwiw [15:46:55] (03CR) 10CDanis: [C: 03+1] oauth2-proxy: run as nobody or explicit uid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994182 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [15:47:12] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2368.codfw.wmnet with reason: host reimage [15:47:14] (03CR) 10C. Scott Ananian: [C: 03+1] mobileapps: Switchover PCS to core page HTML [deployment-charts] - 10https://gerrit.wikimedia.org/r/994177 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [15:47:29] (03Abandoned) 10Jgiannelos: mobileapps: Use core /page/html output in all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/992975 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [15:47:40] PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [15:48:00] (03PS7) 10Arnaudb: admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [15:48:30] (it also looks like the maintenance script is slowly using more and more RAM, but it currently looks like it’ll finish well before mwmaint2002 runs out of memory) [15:50:00] 10SRE, 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T356146 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated cable on both ends. came up. [15:50:02] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [15:50:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1482.eqiad.wmnet with reason: host reimage [15:52:14] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [15:52:26] ^^ cert alerts are (kinda) expected, will silence [15:53:14] ACKNOWLEDGEMENT - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) Brian_King T355617 https://wikitech.wikimedia.org/wiki/Search [15:53:14] ACKNOWLEDGEMENT - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) Brian_King T355617 https://wikitech.wikimedia.org/wiki/Search [15:53:14] ACKNOWLEDGEMENT - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) Brian_King T355617 https://wikitech.wikimedia.org/wiki/Search [15:53:14] ACKNOWLEDGEMENT - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) Brian_King T355617 https://wikitech.wikimedia.org/wiki/Search [15:53:19] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2370.codfw.wmnet with reason: host reimage [15:54:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617 [15:54:42] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [15:54:45] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [15:54:52] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [15:56:18] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1440.eqiad.wmnet with reason: host reimage [15:57:20] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [15:57:28] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [15:58:08] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cloudelastic1010.eqiad.wmnet with reason: T355617 [15:58:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudelastic1010.eqiad.wmnet with reason: T355617 [15:59:09] Lucas_WMDE: hmm. maybe i should have limited it to active users [15:59:22] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2366.codfw.wmnet with reason: host reimage [15:59:32] or maybe there's some bug i missed, that seems like a lot of invalid signatures [16:00:02] I can send you a random subset if you like [16:00:05] eoghan, jelto, and arnoldokoth: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T1600). [16:00:17] and actually, a lot of signatures in general. i didn't think that many users customize them [16:00:20] (currently at 68k lines) [16:00:58] one random one I picked out has 0 contributions on enwiki o_O [16:01:53] looking at user_properties, they have fancysig=1, but nothing that looks like an actual signature [16:02:17] ah, “nickname” is the signature [16:02:35] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1457.eqiad.wmnet with reason: host reimage [16:03:17] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1459.eqiad.wmnet with OS bullseye [16:03:24] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1459.eqiad.wmnet with OS bullseye completed: - mw1459 (**PASS**) - Downtimed on Icinga/Alertma... [16:03:49] (03PS1) 10Ebernhardson: Run CheckerJob against read-only clusters [extensions/CirrusSearch] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994142 (https://phabricator.wikimedia.org/T354793) [16:04:06] (03PS1) 10Ebernhardson: Run CheckerJob against read-only clusters [extensions/CirrusSearch] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994143 (https://phabricator.wikimedia.org/T354793) [16:05:39] (03CR) 10Klausman: [C: 03+1] admin_ng: elevate ml users experimental permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/994117 (https://phabricator.wikimedia.org/T354516) (owner: 10Ilias Sarantopoulos) [16:05:54] (03PS1) 10Alexandros Kosiaris: mw-debug: Enable tracing with 100% sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/994193 (https://phabricator.wikimedia.org/T351566) [16:06:25] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1466.eqiad.wmnet with reason: host reimage [16:06:26] (the random user I picked out did in fact have an invalid signature, FWIW. script continues to run for now) [16:08:14] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2368.codfw.wmnet with OS bullseye [16:09:16] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1482.eqiad.wmnet with OS bullseye [16:09:25] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1482.eqiad.wmnet with OS bullseye completed: - mw1482 (**PASS**) - Downtimed on Icinga/Alertma... [16:11:22] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [16:13:12] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2370.codfw.wmnet with OS bullseye [16:13:32] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1008.wikimedia.org [16:14:40] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617 [16:14:45] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [16:14:59] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1440.eqiad.wmnet with OS bullseye [16:15:07] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1440.eqiad.wmnet with OS bullseye completed: - mw1440 (**PASS**) - Downtimed on Icinga/Alertma... [16:18:48] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2366.codfw.wmnet with OS bullseye [16:21:19] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1457.eqiad.wmnet with OS bullseye [16:21:27] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1457.eqiad.wmnet with OS bullseye completed: - mw1457 (**PASS**) - Downtimed on Icinga/Alertma... [16:25:15] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1466.eqiad.wmnet with OS bullseye [16:25:23] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1466.eqiad.wmnet with OS bullseye completed: - mw1466 (**PASS**) - Downtimed on Icinga/Alertma... [16:27:09] (03CR) 10Ssingh: "Thanks for the patch! I guess looking at check_timedatectl and making sure that node_timex_offset_seconds does everything that Perl script" [puppet] - 10https://gerrit.wikimedia.org/r/994172 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [16:28:36] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on gitlab2002.wikimedia.org with reason: server move [16:28:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gitlab2002.wikimedia.org with reason: server move [16:29:02] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:10] (03CR) 10Jdlrobson: "Timing corresponds with this UBN error: https://phabricator.wikimedia.org/T356193" [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994028 (https://phabricator.wikimedia.org/T355933) (owner: 10WMDE-Fisch) [16:29:25] (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on cloudelastic1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:29:34] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on gitlab.wikimedia.org with reason: server move [16:29:48] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on gitlab.wikimedia.org with reason: server move [16:30:28] PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:05] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Bump memory limit by 200Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/994196 (https://phabricator.wikimedia.org/T266216) [16:32:07] (03PS1) 10Alexandros Kosiaris: rdf-streaming-updated: Bump taskmanager memory limit by ~33% [deployment-charts] - 10https://gerrit.wikimedia.org/r/994197 (https://phabricator.wikimedia.org/T266216) [16:34:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617 [16:34:05] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [16:35:06] (03Abandoned) 10Alexandros Kosiaris: icu67: Setup shellbox-icu67 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955772 (https://phabricator.wikimedia.org/T329491) (owner: 10Alexandros Kosiaris) [16:36:19] (03CR) 10Jdlrobson: "This https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/994198" [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994028 (https://phabricator.wikimedia.org/T355933) (owner: 10WMDE-Fisch) [16:37:35] (03PS1) 10Jgiannelos: mobileapps: Add missing template for MW parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/994199 [16:38:17] (03PS2) 10Jgiannelos: mobileapps: Add missing template for MW parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/994199 (https://phabricator.wikimedia.org/T339865) [16:40:59] (03CR) 10CDanis: [C: 03+1] mw-debug: Enable tracing with 100% sampling [deployment-charts] - 10https://gerrit.wikimedia.org/r/994193 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [16:41:46] (03CR) 10Majavah: [C: 03+2] P:toolforge: mailrelay: workaround Exim 4.94 taints [puppet] - 10https://gerrit.wikimedia.org/r/993693 (https://phabricator.wikimedia.org/T311910) (owner: 10Majavah) [16:43:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:52] (03CR) 10Hnowlan: [C: 03+1] mobileapps: Add missing template for MW parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/994199 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [16:44:05] !log gitlab is down for maintenance for a few minutes [16:44:08] (03PS3) 10Majavah: exim4: Allow installing from component [puppet] - 10https://gerrit.wikimedia.org/r/994166 (https://phabricator.wikimedia.org/T356171) [16:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:10] (03PS3) 10Majavah: P:toolforge::mailrelay: ARC sign outbound mail [puppet] - 10https://gerrit.wikimedia.org/r/994167 (https://phabricator.wikimedia.org/T356171) [16:44:44] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Add missing template for MW parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/994199 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [16:45:36] (03Merged) 10jenkins-bot: mobileapps: Add missing template for MW parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/994199 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [16:45:40] (ProbeDown) firing: (4) Service debmonitor2002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:22] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:47:27] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:47:40] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:48:42] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:49:16] !log gitlab is back [16:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:39] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [16:50:16] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) We did the last server move today. Thanks for All [16:51:34] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) 05Open→03Resolved a:03Papaul [16:54:30] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: sync [16:54:46] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [16:54:55] !log Running homer 'cr*codfw*' commit 'T351074' [16:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:00] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:56:14] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1007.wikimedia.org [16:56:25] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1008.wikimedia.org [16:56:33] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1009.wikimedia.org [16:56:40] (03Abandoned) 10Lucas Werkmeister (WMDE): Log more information on LexemePatcher errors [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/993503 (https://phabricator.wikimedia.org/T284061) (owner: 10Lucas Werkmeister (WMDE)) [16:56:49] (03PS1) 10Jforrester: Do not search for elements if no previews have been registered [extensions/Popups] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994202 (https://phabricator.wikimedia.org/T355933) [16:57:03] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617 [16:57:08] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [16:57:44] RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:16] (03PS1) 10Jforrester: Do not search for elements if no previews have been registered [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994203 (https://phabricator.wikimedia.org/T355933) [16:59:25] (SystemdUnitFailed) resolved: elasticsearch-disable-readahead.service Failed on cloudelastic1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:29] (03CR) 10Jdlrobson: [C: 03+1] Do not search for elements if no previews have been registered [extensions/Popups] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994202 (https://phabricator.wikimedia.org/T355933) (owner: 10Jforrester) [17:00:04] jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:58] I may sling out a UBN fix. [17:04:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/Popups] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994202 (https://phabricator.wikimedia.org/T355933) (owner: 10Jforrester) [17:04:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994203 (https://phabricator.wikimedia.org/T355933) (owner: 10Jforrester) [17:05:39] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Dzahn) >>! In T355437#9498404, @Count_Count wrote: > The banner says 8:30 to 8:40 UTC. I was confused that it was still there. Yea, the start got delayed a little bit with things like... [17:08:58] (03Merged) 10jenkins-bot: Do not search for elements if no previews have been registered [extensions/Popups] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994202 (https://phabricator.wikimedia.org/T355933) (owner: 10Jforrester) [17:10:04] (03Merged) 10jenkins-bot: Do not search for elements if no previews have been registered [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994203 (https://phabricator.wikimedia.org/T355933) (owner: 10Jforrester) [17:10:29] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:994202|Do not search for elements if no previews have been registered (T355933 T356186 T356193)]], [[gerrit:994203|Do not search for elements if no previews have been registered (T355933 T356186 T356193)]] [17:10:37] T355933: Reference Previews not showing when users disable page previews in settings - https://phabricator.wikimedia.org/T355933 [17:10:38] T356186: "Uncaught DOMException: Element.closest: '' is not a valid selector" when Page Previews is disabled but Navigation Popups gadget is enabled - https://phabricator.wikimedia.org/T356186 [17:10:38] T356193: SyntaxError: Failed to execute 'closest' on 'Element': The provided selector is empty. - https://phabricator.wikimedia.org/T356193 [17:11:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "Let's try!" [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [17:11:45] MatmaRex: ETA for the maintenance script about three more hours btw [17:12:13] I’ll try to still be online and !log around the right time when it finishes [17:13:47] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2005.codfw.wmnet with OS bookworm [17:14:34] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:994202|Do not search for elements if no previews have been registered (T355933 T356186 T356193)]], [[gerrit:994203|Do not search for elements if no previews have been registered (T355933 T356186 T356193)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:15:38] !log jforrester@deploy2002 jforrester: Continuing with sync [17:16:53] (03PS1) 10Jgiannelos: mobileapps: Enable trace logs for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994209 (https://phabricator.wikimedia.org/T339865) [17:21:08] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [17:22:21] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:994202|Do not search for elements if no previews have been registered (T355933 T356186 T356193)]], [[gerrit:994203|Do not search for elements if no previews have been registered (T355933 T356186 T356193)]] (duration: 11m 51s) [17:22:29] T355933: Reference Previews not showing when users disable page previews in settings - https://phabricator.wikimedia.org/T355933 [17:22:29] T356186: "Uncaught DOMException: Element.closest: '' is not a valid selector" when Page Previews is disabled but Navigation Popups gadget is enabled - https://phabricator.wikimedia.org/T356186 [17:22:29] T356193: SyntaxError: Failed to execute 'closest' on 'Element': The provided selector is empty. - https://phabricator.wikimedia.org/T356193 [17:26:37] (03PS4) 10Majavah: P:toolforge::mailrelay: ARC sign outbound mail [puppet] - 10https://gerrit.wikimedia.org/r/994167 (https://phabricator.wikimedia.org/T356171) [17:27:57] (03CR) 10Hnowlan: mobileapps: Enable trace logs for debugging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994209 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [17:31:23] (03PS2) 10Jgiannelos: mobileapps: Enable trace logs for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994209 (https://phabricator.wikimedia.org/T339865) [17:32:23] (03CR) 10Jgiannelos: mobileapps: Enable trace logs for debugging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994209 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [17:32:59] (03PS3) 10Jgiannelos: mobileapps: Enable trace logs for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994209 (https://phabricator.wikimedia.org/T339865) [17:35:07] Lucas_WMDE: thanks [17:37:06] !log DROP test_spark3_loading keyspace, Generated Data (Cassandra) cluster — T356112 [17:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:11] T356112: Generated Data Platform (neé AQS): remove (unused/uneeded) test_spark3_loading keyspace - https://phabricator.wikimedia.org/T356112 [17:37:49] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate new elastic config - bking@cumin2002 - T355617 [17:37:54] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [17:38:57] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [17:48:21] * Lucas_WMDE still running the maintenance script but mostly afk, if anything’s wrong don’t hesitate to kill it without waiting for me to respond [17:48:41] (“the maintenance script” being CheckSignatures, on mwmaint2002, for T356168 ^^) [17:48:42] T356168: Add a way to generate a list of users with an invalid signatures - https://phabricator.wikimedia.org/T356168 [17:51:46] (03PS1) 10Superpes15: [ukwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994211 (https://phabricator.wikimedia.org/T355972) [17:56:11] (03CR) 10Hnowlan: [C: 03+1] mobileapps: Enable trace logs for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994209 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T1800) [18:00:30] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Enable trace logs for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994209 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [18:01:34] (03Merged) 10jenkins-bot: mobileapps: Enable trace logs for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994209 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [18:02:12] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:02:16] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:02:24] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:02:28] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:02:39] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:02:43] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:03:03] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:03:07] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:03:07] (03PS1) 10Daniel Kinzler: Configure parser cache filters for parsoid-pcache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) [18:03:21] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:03:25] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:04:06] (03CR) 10CI reject: [V: 04-1] Configure parser cache filters for parsoid-pcache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [18:04:34] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:05:15] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) 05In progress→03Resolved The change has been made permanent and the entire "dumps related" section is now gone from the legacy exim aliases controlled... [18:05:22] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:05:39] (03PS2) 10Daniel Kinzler: Configure parser cache filters for parsoid-pcache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) [18:07:06] (03PS1) 10Superpes15: [ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994214 (https://phabricator.wikimedia.org/T354850) [18:07:49] (03CR) 10CI reject: [V: 04-1] [ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994214 (https://phabricator.wikimedia.org/T354850) (owner: 10Superpes15) [18:08:31] (03PS2) 10Superpes15: [ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994214 (https://phabricator.wikimedia.org/T354850) [18:09:15] (03PS3) 10Daniel Kinzler: Configure parser cache filters for parsoid-pcache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) [18:10:13] (03PS4) 10Daniel Kinzler: Configure parser cache filters for parsoid-pcache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) [18:10:51] (03CR) 10Dzahn: [C: 03+1] P:url_downloader absent Icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/994170 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [18:11:05] (03PS1) 10Jgiannelos: mobileapps: Fix MW core request template key [deployment-charts] - 10https://gerrit.wikimedia.org/r/994215 [18:11:46] (03PS2) 10Jgiannelos: mobileapps: Fix MW core request template name [deployment-charts] - 10https://gerrit.wikimedia.org/r/994215 (https://phabricator.wikimedia.org/T339865) [18:13:32] (03CR) 10Dzahn: [C: 03+2] Enable Puppet for Phabricator on the role level [puppet] - 10https://gerrit.wikimedia.org/r/994148 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [18:16:22] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [18:17:16] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [18:27:55] (03CR) 10Subramanya Sastry: Configure parser cache filters for parsoid-pcache (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [18:29:16] (03CR) 10Subramanya Sastry: Configure parser cache filters for parsoid-pcache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [18:29:39] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Rmaung) Just tagging @Eevans and @BBlack as I believe you are the Clinic duty SREs? [18:29:46] 10SRE, 10SRE-Access-Requests: Remove `maurelio` from the `ldap/nda` group - https://phabricator.wikimedia.org/T356203 (10MarcoAurelio) [18:29:51] (03PS1) 10Superpes15: [ganwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994220 (https://phabricator.wikimedia.org/T355854) [18:31:09] 10SRE, 10LDAP-Access-Requests: Remove `maurelio` from the `ldap/nda` group - https://phabricator.wikimedia.org/T356203 (10MarcoAurelio) [18:31:31] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Rmaung) Oops, or @ABran-WMF [18:33:22] 10SRE, 10LDAP-Access-Requests: Remove `maurelio` from the `ldap/nda` group - https://phabricator.wikimedia.org/T356203 (10Superpes15) :( [18:42:09] 10SRE, 10LDAP-Access-Requests: Remove `maurelio` from the `ldap/nda` group - https://phabricator.wikimedia.org/T356203 (10Dzahn) @MarcoAurelio We are sad to see you go. Let's start by telling Katie about it I guess so she can keep the records updated. @KFrancis https://meta.wikimedia.org/wiki/User:KFrancis_(... [18:44:04] (03CR) 10Ahmon Dancy: foreachwikiindblist: Return early when no arg is passed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992263 (owner: 10Zabe) [18:46:09] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@ccaa5dc]: (no justification provided) [18:46:15] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@ccaa5dc]: (no justification provided) (duration: 00m 05s) [18:46:36] (03CR) 10Ahmon Dancy: foreachwikiindblist: Return early when no arg is passed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992263 (owner: 10Zabe) [18:52:15] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:52:26] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T1900) [19:00:27] o/ [19:01:52] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994221 (https://phabricator.wikimedia.org/T354434) [19:01:54] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994221 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [19:02:56] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994221 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [19:09:01] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [19:10:34] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.16 refs T354434 [19:10:39] T354434: 1.42.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T354434 [19:11:13] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10xcollazo) For completeness: Group at: https://groups.google.com/a/wikimedia.org/g/ops-dumps Successful forward of ops-dumps to data-engineering-alerts@lists.wi... [19:11:21] (03PS1) 10Ayounsi: DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) [19:12:37] (03CR) 10CI reject: [V: 04-1] DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [19:13:22] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) Thanks for adding that @xcollazo and letting us move this out of the exim aliases. This was like the last group alias left to move from an epic 2015 (!)... [19:13:53] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) [19:13:55] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) [19:19:45] (03PS1) 10Jdlrobson: Enable desktop diff HTML on mobile pages for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994224 (https://phabricator.wikimedia.org/T350181) [19:20:27] (03CR) 10CI reject: [V: 04-1] Enable desktop diff HTML on mobile pages for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994224 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [19:22:55] (03PS1) 10Eevans: cassandra: cassandra roles for druid-based aqs endpoints [puppet] - 10https://gerrit.wikimedia.org/r/994225 (https://phabricator.wikimedia.org/T355917) [19:23:40] (03PS2) 10Eevans: cassandra: cassandra roles for druid-based aqs endpoints [puppet] - 10https://gerrit.wikimedia.org/r/994225 (https://phabricator.wikimedia.org/T352948) [19:27:44] !log FINISHED lucaswerkmeister-wmde@mwmaint2002:~$ mwscript CheckSignatures enwiki | tee T356168 # -- 268378 invalid signatures -- [19:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:49] T356168: Add a way to generate a list of users with an invalid signatures - https://phabricator.wikimedia.org/T356168 [19:28:10] (03PS1) 10Ayounsi: DNS: add includes for private1-virtual-codfw DNS PTRs [dns] - 10https://gerrit.wikimedia.org/r/994246 (https://phabricator.wikimedia.org/T300152) [19:31:07] (03CR) 10Ssingh: [C: 03+1] DNS: add includes for private1-virtual-codfw DNS PTRs [dns] - 10https://gerrit.wikimedia.org/r/994246 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [19:33:28] (03CR) 10Andrew Bogott: [C: 03+1] wmcs-image-create: remove cloud-init-finished flag if present [puppet] - 10https://gerrit.wikimedia.org/r/992677 (owner: 10Majavah) [19:35:56] (03PS2) 10Ayounsi: DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) [19:36:27] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1010.eqiad.wmnet for use cloudelastic1010 as migration canary - bking@cumin2002 - T355617 [19:36:27] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cloudelastic1010.eqiad.wmnet for use cloudelastic1010 as migration canary - bking@cumin2002 - T355617 [19:36:32] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [19:36:42] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [19:36:44] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [19:37:16] (03CR) 10CI reject: [V: 04-1] DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [19:40:42] (03PS3) 10Ayounsi: DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) [19:42:00] (03CR) 10CI reject: [V: 04-1] DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [19:45:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: wdqs::public [19:46:20] 10SRE, 10LDAP-Access-Requests: Remove `maurelio` from the `ldap/nda` group - https://phabricator.wikimedia.org/T356203 (10KFrancis) I've updated my list. Thanks!!! [19:46:34] (03PS1) 10Muehlenhoff: Switch wdqs/public to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/994249 (https://phabricator.wikimedia.org/T349619) [19:50:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch wdqs/public to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/994249 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [19:52:37] (03PS1) 10Majavah: P:toolforge::mailrelay: don't blindly reject any bounces [puppet] - 10https://gerrit.wikimedia.org/r/994250 [19:55:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994166 (https://phabricator.wikimedia.org/T356171) (owner: 10Majavah) [19:59:16] (03CR) 10Majavah: [C: 03+2] exim4: Allow installing from component [puppet] - 10https://gerrit.wikimedia.org/r/994166 (https://phabricator.wikimedia.org/T356171) (owner: 10Majavah) [20:01:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wdqs::public [20:01:44] (03PS1) 10Bking: cloudelastic: listen on "site" scoped IPs as well [puppet] - 10https://gerrit.wikimedia.org/r/994251 (https://phabricator.wikimedia.org/T355617) [20:02:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994251 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:02:26] (03PS2) 10Ryan Kemper: cloudelastic: listen on "site" scoped IPs as well [puppet] - 10https://gerrit.wikimedia.org/r/994251 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:02:30] (03CR) 10Ebernhardson: [C: 03+1] cloudelastic: listen on "site" scoped IPs as well [puppet] - 10https://gerrit.wikimedia.org/r/994251 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:03:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [20:05:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994251 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:07:41] (03PS1) 10Dzahn: admin: absent maurelio from ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/994252 (https://phabricator.wikimedia.org/T356203) [20:07:43] (03CR) 10Bking: [C: 03+2] cloudelastic: listen on "site" scoped IPs as well [puppet] - 10https://gerrit.wikimedia.org/r/994251 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:09:16] (03CR) 10CI reject: [V: 04-1] admin: absent maurelio from ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/994252 (https://phabricator.wikimedia.org/T356203) (owner: 10Dzahn) [20:10:24] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Remove `maurelio` from the `ldap/nda` group - https://phabricator.wikimedia.org/T356203 (10Dzahn) @ABran-WMF I uploaded a patch. Wanna review and take on the rest as clinic duty? [20:12:36] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Dzahn) [20:12:58] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: green, timed_out: False, number_of_nodes: 10, number_of_data_nodes: 10, active_primary_shards: 798, active_shards: 1598, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of_i [20:12:58] _fetch: 0, task_max_waiting_in_queue_millis: 112, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:13:14] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: green, timed_out: False, number_of_nodes: 10, number_of_data_nodes: 10, active_primary_shards: 764, active_shards: 1531, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fl [20:13:14] ch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:16:40] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Dzahn) for clinic duty: this ticket mixes an LDAP access request (wmf) and a shell access request ( analytics-privatedata-users), which are different types of groups and... [20:16:45] (03PS3) 10Eevans: sessionstore: provision sessionstore1004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989628 (https://phabricator.wikimedia.org/T353402) [20:16:47] (03PS3) 10Eevans: sessionstore: provision sessionstore1005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989629 (https://phabricator.wikimedia.org/T353402) [20:16:49] (03PS3) 10Eevans: sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402) [20:16:51] (03PS3) 10Eevans: sessionstore: configure new hosts to reuse /srv [puppet] - 10https://gerrit.wikimedia.org/r/989631 (https://phabricator.wikimedia.org/T353402) [20:20:41] (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore1004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989628 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [20:21:34] (03PS1) 10Superpes15: [eswiki] Add 13 namespaces to $wgExemptFromUserRobotsControl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994254 (https://phabricator.wikimedia.org/T355033) [20:24:35] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Rmaung) Thank you @Dzahn -- this would be a shell account with Kerberos. [20:27:24] (03PS2) 10Superpes15: [eswiki] Add 13 namespaces to $wgExemptFromUserRobotsControl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994254 (https://phabricator.wikimedia.org/T355033) [20:27:46] (03PS3) 10Superpes15: [ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994214 (https://phabricator.wikimedia.org/T354850) [20:33:13] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install X - https://phabricator.wikimedia.org/T356216 (10RobH) [20:33:40] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216 (10RobH) a:03Andrew [20:34:33] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216 (10RobH) @andrew: I've assigned this task to you for you to populate the racking details, additionally please add the servers to the site.pp file with the insetup... [20:35:08] !log bootstrapping sessionstore1004/cassandra-a — T353402 [20:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:23] T353402: Provision new sessionstore (eqiad) cluster nodes: sessionstore[4-6] - https://phabricator.wikimedia.org/T353402 [20:38:34] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore1004.eqiad.wmnet with reason: Commissioning — T353402 [20:38:49] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore1004.eqiad.wmnet with reason: Commissioning — T353402 [20:40:32] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) After reading this post on [[ https://www.reddit.com/r/sysadmin/comments/q40ure/comment/hf... [20:45:40] (ProbeDown) firing: (4) Service debmonitor2002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:51:22] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate first private IP host config - bking@cumin2002 - T355617 [20:51:29] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [20:52:52] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate first private IP host config - bking@cumin2002 - T355617 [20:53:22] (03CR) 10Volans: "I'm wondering if this might have side effects in some cases and I'm wondering if we should instead set the host-name or fqdn option in the" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [20:53:36] (03PS1) 10Dzahn: add google-site-verification to gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) [20:54:46] (03PS2) 10Jdlrobson: Enable desktop diff HTML on mobile pages for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994224 (https://phabricator.wikimedia.org/T350181) [20:56:38] (03CR) 10Dzahn: "this is from the "verify ownership" dialog on https://postmaster.google.com/u/2/managedomains" [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) (owner: 10Dzahn) [20:57:08] (03PS1) 10Jforrester: build: Upgrade phpunit to 9.6.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994317 (https://phabricator.wikimedia.org/T342110) [20:59:35] (03PS1) 10Bking: cloudelastic: bind to all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/994321 (https://phabricator.wikimedia.org/T355617) [20:59:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994321 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240130T2100). [21:00:05] ebernhardson and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:30] (03CR) 10Reedy: [C: 03+2] build: Upgrade phpunit to 9.6.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994317 (https://phabricator.wikimedia.org/T342110) (owner: 10Jforrester) [21:00:33] Hi :) [21:00:44] \o [21:01:24] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for sessionstore1004.eqiad.wmnet [21:01:24] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for sessionstore1004.eqiad.wmnet [21:02:18] (03Merged) 10jenkins-bot: build: Upgrade phpunit to 9.6.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994317 (https://phabricator.wikimedia.org/T342110) (owner: 10Jforrester) [21:05:35] hi - i can deploy - pardon lateness [21:05:57] ebernhardson: do you want to self-deploy or want me to do it? [21:06:19] :) Np I'll be here in 5-10 minutes so you can process other patched :D [21:06:29] cjming: you can deploy [21:06:48] sounds good [21:07:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994142 (https://phabricator.wikimedia.org/T354793) (owner: 10Ebernhardson) [21:08:02] * urbanecm waves as well [21:08:07] and sees cjming is deploying [21:08:34] hi urbanecm: thanks for checking in [21:13:02] (03CR) 10Clare Ming: [C: 03+2] Run CheckerJob against read-only clusters [extensions/CirrusSearch] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994143 (https://phabricator.wikimedia.org/T354793) (owner: 10Ebernhardson) [21:14:25] (03CR) 10Bking: "self-merging, as this is breaking the service and I've confirmed it works by one-offing cloudelastic1003." [puppet] - 10https://gerrit.wikimedia.org/r/994321 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:14:27] (03CR) 10Bking: [C: 03+2] cloudelastic: bind to all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/994321 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:21:41] I’m here btw [21:24:01] hi Superpes: CI is taking a while on the first patches in queue -- it'll probably be another 10ish minutes before we get to your patches [21:24:11] Yep wonderful [21:26:17] ebernhardson: are your backports testable? should i just sync when ready? [21:27:53] cjming: only changes is in the job queue, sync when it goes through [21:28:00] (03Merged) 10jenkins-bot: Run CheckerJob against read-only clusters [extensions/CirrusSearch] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994142 (https://phabricator.wikimedia.org/T354793) (owner: 10Ebernhardson) [21:28:53] !log cjming@deploy2002 Started scap: Backport for [[gerrit:994142|Run CheckerJob against read-only clusters (T354793)]] [21:28:58] T354793: SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs - https://phabricator.wikimedia.org/T354793 [21:30:22] !log cjming@deploy2002 ebernhardson and cjming: Backport for [[gerrit:994142|Run CheckerJob against read-only clusters (T354793)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:30:28] !log cjming@deploy2002 ebernhardson and cjming: Continuing with sync [21:32:22] (03CR) 10Urbanecm: [C: 03+1] "if we're getting closer to the limit, then this make sense. however, i am a little curious what takes almost all of the 650Mi allocated. b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994196 (https://phabricator.wikimedia.org/T266216) (owner: 10Alexandros Kosiaris) [21:33:14] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Dzahn) ACK, thanks for confirming that @Rmaung [21:33:50] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10Dzahn) a:03WMDECyn [21:34:06] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10Dzahn) 05Open→03In progress [21:34:32] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate first private IP host config - bking@cumin2002 - T355617 [21:34:43] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [21:34:43] (03Merged) 10jenkins-bot: Run CheckerJob against read-only clusters [extensions/CirrusSearch] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994143 (https://phabricator.wikimedia.org/T354793) (owner: 10Ebernhardson) [21:36:42] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:994142|Run CheckerJob against read-only clusters (T354793)]] (duration: 07m 49s) [21:36:48] T354793: SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs - https://phabricator.wikimedia.org/T354793 [21:37:10] !log cjming@deploy2002 Started scap: Backport for [[gerrit:994143|Run CheckerJob against read-only clusters (T354793)]] [21:38:37] !log cjming@deploy2002 cjming and ebernhardson: Backport for [[gerrit:994143|Run CheckerJob against read-only clusters (T354793)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:38:47] !log cjming@deploy2002 cjming and ebernhardson: Continuing with sync [21:38:57] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [21:39:25] (03PS1) 10JHathaway: donate: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994331 (https://phabricator.wikimedia.org/T356221) [21:39:44] (03PS2) 10Clare Ming: [ukwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994211 (https://phabricator.wikimedia.org/T355972) (owner: 10Superpes15) [21:40:17] cjming You can also deploy 2-3 of them together :D [21:40:17] (03CR) 10CI reject: [V: 04-1] donate: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994331 (https://phabricator.wikimedia.org/T356221) (owner: 10JHathaway) [21:40:26] !log LDAP - added brennen to group releng (T356043) - already done/approved in the past in T215365 [21:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:33] T356043: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 [21:40:33] T215365: LDAP requests for Brennen Bearnes: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T215365 [21:41:02] (03PS2) 10JHathaway: donate: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994331 (https://phabricator.wikimedia.org/T356221) [21:41:24] !log LDAP - added jhuneidi to group releng (T356043) - already done/approved in the past in T210028 [21:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:30] T210028: LDAP requests for Jeena Huneidi - https://phabricator.wikimedia.org/T210028 [21:41:43] (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore1005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989629 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [21:42:12] !log LDAP - added jnuche to group releng (T356043) - already done/approved in the past in T301149 [21:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:19] T301149: Grant Access to wmf, releng, ciadmin for jnuche - https://phabricator.wikimedia.org/T301149 [21:43:22] Superpes: ya - will do [21:43:28] 10SRE, 10LDAP-Access-Requests, 10LDAP: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 (10Dzahn) [21:44:27] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10bcampbell) @jhathaway All good, thanks for the breakdown. I also CCd you in the support interaction wi... [21:44:34] (03CR) 10Clare Ming: [C: 03+2] [ukwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994211 (https://phabricator.wikimedia.org/T355972) (owner: 10Superpes15) [21:44:51] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:994143|Run CheckerJob against read-only clusters (T354793)]] (duration: 07m 41s) [21:44:57] T354793: SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs - https://phabricator.wikimedia.org/T354793 [21:45:05] ebernhardson: your patches should be live! [21:45:17] (03Merged) 10jenkins-bot: [ukwiki] Change autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994211 (https://phabricator.wikimedia.org/T355972) (owner: 10Superpes15) [21:45:29] (03PS4) 10Clare Ming: [ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994214 (https://phabricator.wikimedia.org/T354850) (owner: 10Superpes15) [21:46:32] (03CR) 10Clare Ming: [C: 03+2] [ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994214 (https://phabricator.wikimedia.org/T354850) (owner: 10Superpes15) [21:46:37] cjming: thanks! [21:46:55] np! [21:47:13] (03Merged) 10jenkins-bot: [ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994214 (https://phabricator.wikimedia.org/T354850) (owner: 10Superpes15) [21:47:24] (03PS2) 10Clare Ming: [ganwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994220 (https://phabricator.wikimedia.org/T355854) (owner: 10Superpes15) [21:47:46] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10KFrancis) The NDA has been processed and out for signatures. I'll confirm when it's complete. [21:48:16] (03CR) 10Clare Ming: [C: 03+2] [ganwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994220 (https://phabricator.wikimedia.org/T355854) (owner: 10Superpes15) [21:48:58] (03Merged) 10jenkins-bot: [ganwiki] Add new namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994220 (https://phabricator.wikimedia.org/T355854) (owner: 10Superpes15) [21:49:49] !log cjming@deploy2002 Started scap: Backport for [[gerrit:994211|[ukwiki] Change autoconfirmed setting (T355972)]], [[gerrit:994214|[ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods (T354850)]], [[gerrit:994220|[ganwiki] Add new namespace aliases (T355854)]] [21:49:58] T355972: Additional requirements to obtain autoconfirmed rights in ukwiki - https://phabricator.wikimedia.org/T355972 [21:49:59] T354850: Make some config changes for transwiki usergroup in gan.wikipedia - https://phabricator.wikimedia.org/T354850 [21:49:59] T355854: Add some namespace alias for gan.wikipedia - https://phabricator.wikimedia.org/T355854 [21:50:13] (03PS3) 10Clare Ming: [eswiki] Add 13 namespaces to $wgExemptFromUserRobotsControl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994254 (https://phabricator.wikimedia.org/T355033) (owner: 10Superpes15) [21:50:27] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore1005.eqiad.wmnet with reason: Bootstrapping — T353402 [21:50:32] T353402: Provision new sessionstore (eqiad) cluster nodes: sessionstore[4-6] - https://phabricator.wikimedia.org/T353402 [21:50:43] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore1005.eqiad.wmnet with reason: Bootstrapping — T353402 [21:51:26] !log cjming@deploy2002 superpes and cjming: Backport for [[gerrit:994211|[ukwiki] Change autoconfirmed setting (T355972)]], [[gerrit:994214|[ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods (T354850)]], [[gerrit:994220|[ganwiki] Add new namespace aliases (T355854)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:51:37] Testing :) [21:52:05] cool - I scap'd your first 3 patches together [21:52:59] They works thanks :D cjming [21:53:07] *Work [21:53:12] great - syncing [21:53:16] !log cjming@deploy2002 superpes and cjming: Continuing with sync [21:54:17] (03PS6) 10Bking: cloudelastic: enable DNS discovery/VIP for test service [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) [21:54:33] (03Abandoned) 10Bking: cloudelastic: enable DNS discovery/VIP for test service [puppet] - 10https://gerrit.wikimedia.org/r/992748 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:54:47] (03Abandoned) 10Bking: cloudelastic: add CNAME for migration canary [dns] - 10https://gerrit.wikimedia.org/r/993014 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:56:42] You probably should run namespacedupes.php after deploying cjming [21:57:33] (03CR) 10JHathaway: [C: 03+1] "looks good, perhaps add a comment as to its use." [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) (owner: 10Dzahn) [21:57:38] yup - i'll do that for both ganwiki + eswiki after they're all live [21:58:31] Yep thanks! Just a reminder :D [21:59:03] appreciate the reminder - i often forget [21:59:22] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:994211|[ukwiki] Change autoconfirmed setting (T355972)]], [[gerrit:994214|[ganwiki] Add 'suppressredirect' to transwiki usergroup and change assignment and revocation methods (T354850)]], [[gerrit:994220|[ganwiki] Add new namespace aliases (T355854)]] (duration: 09m 32s) [21:59:33] T355972: Additional requirements to obtain autoconfirmed rights in ukwiki - https://phabricator.wikimedia.org/T355972 [21:59:33] T354850: Make some config changes for transwiki usergroup in gan.wikipedia - https://phabricator.wikimedia.org/T354850 [21:59:34] T355854: Add some namespace alias for gan.wikipedia - https://phabricator.wikimedia.org/T355854 [21:59:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994254 (https://phabricator.wikimedia.org/T355033) (owner: 10Superpes15) [22:00:07] (03PS2) 10Dzahn: add google-site-verification to gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) [22:00:14] (03CR) 10Dzahn: "comment added" [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) (owner: 10Dzahn) [22:00:17] (03Merged) 10jenkins-bot: [eswiki] Add 13 namespaces to $wgExemptFromUserRobotsControl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994254 (https://phabricator.wikimedia.org/T355033) (owner: 10Superpes15) [22:00:43] !log cjming@deploy2002 Started scap: Backport for [[gerrit:994254|[eswiki] Add 13 namespaces to $wgExemptFromUserRobotsControl (T355033)]] [22:00:49] T355033: Disable magic word INDEX sitewide on eswiki - https://phabricator.wikimedia.org/T355033 [22:02:15] !log cjming@deploy2002 cjming and superpes: Backport for [[gerrit:994254|[eswiki] Add 13 namespaces to $wgExemptFromUserRobotsControl (T355033)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:02:17] Superpes: shall i sync your last patch? [22:02:41] (03CR) 10Dzahn: [C: 03+1] "looks good to me. it's like what google told me to do for gitlab.wikimedia.org for postmaster tools" [dns] - 10https://gerrit.wikimedia.org/r/994331 (https://phabricator.wikimedia.org/T356221) (owner: 10JHathaway) [22:02:43] Yep thanks cjming [22:02:45] :) [22:02:48] !log cjming@deploy2002 cjming and superpes: Continuing with sync [22:03:58] (03CR) 10JHathaway: [C: 03+1] add google-site-verification to gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) (owner: 10Dzahn) [22:04:09] (03CR) 10JHathaway: [C: 03+2] donate: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994331 (https://phabricator.wikimedia.org/T356221) (owner: 10JHathaway) [22:04:19] (03PS3) 10Dzahn: add google-site-verification to gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) [22:05:38] (03CR) 10Dzahn: [C: 03+2] add google-site-verification to gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) (owner: 10Dzahn) [22:05:45] (03PS4) 10Dzahn: add google-site-verification to gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/994314 (https://phabricator.wikimedia.org/T355776) [22:07:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Papaul) @Marostegui if those hosts have a 10G NIC you don't have a problem for those going into row A and B to connect them to a 10G interface? [22:09:07] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:994254|[eswiki] Add 13 namespaces to $wgExemptFromUserRobotsControl (T355033)]] (duration: 08m 24s) [22:09:13] T355033: Disable magic word INDEX sitewide on eswiki - https://phabricator.wikimedia.org/T355033 [22:09:31] Superpes: all done/live - and i ran namespacedupes [22:09:32] (03CR) 10Paladox: [C: 03+1] gerrit: sync soy email template with version 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) (owner: 10Hashar) [22:09:33] Oh Wonderful [22:09:39] Thanks cjming for your assistance :3 [22:09:53] np! thanks for your patience [22:10:22] !log end of UTC late backport window [22:10:22] (03CR) 10Paladox: [C: 03+1] gerrit: sync soy email template with version 3.7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) (owner: 10Hashar) [22:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:03] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for sessionstore1005.eqiad.wmnet [22:20:04] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for sessionstore1005.eqiad.wmnet [22:25:22] (03PS1) 10JHathaway: Add DKIM & SPF records for wikimediafoundation.myshopify.com [dns] - 10https://gerrit.wikimedia.org/r/994333 (https://phabricator.wikimedia.org/T355833) [22:25:39] 10SRE, 10LDAP-Access-Requests, 10LDAP: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 (10thcipriani) >>! In T356043#9497712, @eoghan wrote: > @thcipriani Can you give a quick approval for @Aklapper, please? Yep, approved! [22:26:14] (03PS2) 10JHathaway: Add DKIM & SPF records for wikimediafoundation.myshopify.com [dns] - 10https://gerrit.wikimedia.org/r/994333 (https://phabricator.wikimedia.org/T355833) [22:35:06] (03PS3) 10Cwhite: logging::collector: add mw accesslog sampling by benthos [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) [22:35:08] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/993476/1249/" [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) (owner: 10Cwhite) [22:38:58] (03PS1) 10Bking: cloudelastic: use acme-chief/letsencrypt with canary [puppet] - 10https://gerrit.wikimedia.org/r/994338 (https://phabricator.wikimedia.org/T355617) [22:40:14] (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [22:40:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: activate first private IP host config - bking@cumin2002 - T355617 [22:41:08] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [22:41:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994338 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:41:21] (03PS4) 10Eevans: sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402) [22:41:23] (03PS4) 10Eevans: sessionstore: configure new hosts to reuse /srv [puppet] - 10https://gerrit.wikimedia.org/r/989631 (https://phabricator.wikimedia.org/T353402) [22:48:54] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on sessionstore1006.eqiad.wmnet with reason: Bootstrapping — T353402 [22:48:59] T353402: Provision new sessionstore (eqiad) cluster nodes: sessionstore[4-6] - https://phabricator.wikimedia.org/T353402 [22:49:08] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on sessionstore1006.eqiad.wmnet with reason: Bootstrapping — T353402 [22:50:02] (03CR) 10JHathaway: "kindly review" [dns] - 10https://gerrit.wikimedia.org/r/994333 (https://phabricator.wikimedia.org/T355833) (owner: 10JHathaway) [22:50:32] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) @bcampbell patch is out for review [22:52:28] (03PS1) 10EoghanGaffney: [vrts] Switch from RuntimeDB to StaticDB for queue indexes [puppet] - 10https://gerrit.wikimedia.org/r/994341 (https://phabricator.wikimedia.org/T355979) [22:57:42] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: use acme-chief/letsencrypt with canary [puppet] - 10https://gerrit.wikimedia.org/r/994338 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:59:30] (03PS1) 10Scott French: icinga: add swfrench to authorized user lists [puppet] - 10https://gerrit.wikimedia.org/r/994342 [22:59:33] (03CR) 10Eevans: [C: 03+2] sessionstore: configure new hosts to reuse /srv [puppet] - 10https://gerrit.wikimedia.org/r/989631 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [22:59:45] (03CR) 10EoghanGaffney: "ar" [puppet] - 10https://gerrit.wikimedia.org/r/994341 (https://phabricator.wikimedia.org/T355979) (owner: 10EoghanGaffney) [23:02:50] (03PS2) 10Scott French: icinga: add swfrench to authorized user lists [puppet] - 10https://gerrit.wikimedia.org/r/994342 [23:03:52] (03CR) 10Scott French: "Thanks in advance for the review, Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/994342 (owner: 10Scott French) [23:06:17] (03PS1) 10Eevans: (faux) keys & certs for new sessionstore hosts [labs/private] - 10https://gerrit.wikimedia.org/r/994347 (https://phabricator.wikimedia.org/T353402) [23:07:12] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for sessionstore1006.eqiad.wmnet [23:07:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for sessionstore1006.eqiad.wmnet [23:07:34] (03CR) 10Eevans: [V: 03+2 C: 03+2] (faux) keys & certs for new sessionstore hosts [labs/private] - 10https://gerrit.wikimedia.org/r/994347 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [23:09:31] (03Abandoned) 10Jdlrobson: Update checkboxHack target node [skins/MinervaNeue] (wmf/1.42.0-wmf.13) - 10https://gerrit.wikimedia.org/r/991050 (https://phabricator.wikimedia.org/T354315) (owner: 10Jdlrobson) [23:09:43] (03PS3) 10Jdlrobson: Enable desktop diff HTML on mobile pages for all logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994224 (https://phabricator.wikimedia.org/T350181) [23:14:10] (03CR) 10RLazarus: [C: 03+1] icinga: add swfrench to authorized user lists [puppet] - 10https://gerrit.wikimedia.org/r/994342 (owner: 10Scott French) [23:16:13] (03CR) 10Scott French: [C: 03+2] icinga: add swfrench to authorized user lists [puppet] - 10https://gerrit.wikimedia.org/r/994342 (owner: 10Scott French) [23:45:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Jhancock.wm) [23:45:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Jhancock.wm) a:03Jhancock.wm [23:54:47] !log LDAP - added aklapper to group releng T356043 [23:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:53] T356043: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 [23:55:09] 10SRE, 10LDAP-Access-Requests, 10LDAP: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 (10Dzahn) [23:55:16] 10SRE, 10LDAP-Access-Requests, 10LDAP: Missing Release Engineering members in LDAP group - https://phabricator.wikimedia.org/T356043 (10Dzahn) 05Open→03Resolved