[00:00:40] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:47] jouncebot: nowandnext [00:00:48] No deployments scheduled for the next 5 hour(s) and 59 minute(s) [00:00:48] In 5 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T0600) [00:01:30] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:58] !log Deploying security patch for T338276 [00:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T336886)', diff saved to https://phabricator.wikimedia.org/P48962 and previous config saved to /var/cache/conftool/dbconfig/20230607-000316-ladsgroup.json [00:03:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [00:03:20] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:03:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [00:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P48963 and previous config saved to /var/cache/conftool/dbconfig/20230607-000337-ladsgroup.json [00:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P48964 and previous config saved to /var/cache/conftool/dbconfig/20230607-000637-ladsgroup.json [00:06:54] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T336886)', diff saved to https://phabricator.wikimedia.org/P48965 and previous config saved to /var/cache/conftool/dbconfig/20230607-000754-ladsgroup.json [00:07:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [00:08:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [00:08:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T336886)', diff saved to https://phabricator.wikimedia.org/P48966 and previous config saved to /var/cache/conftool/dbconfig/20230607-000814-ladsgroup.json [00:08:24] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:08:37] !log urbanecm: Deployed security patch for T338276 [00:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T336886)', diff saved to https://phabricator.wikimedia.org/P48967 and previous config saved to /var/cache/conftool/dbconfig/20230607-001136-ladsgroup.json [00:11:40] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:13:04] PROBLEM - PHP opcache health on mw1494 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:14:56] !log urbanecm: Deployed security patch for T338276 [00:15:01] * urbanecm done [00:21:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:21:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P48968 and previous config saved to /var/cache/conftool/dbconfig/20230607-002143-ladsgroup.json [00:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P48969 and previous config saved to /var/cache/conftool/dbconfig/20230607-002642-ladsgroup.json [00:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P48970 and previous config saved to /var/cache/conftool/dbconfig/20230607-003649-ladsgroup.json [00:39:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927773 [00:39:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927773 (owner: 10TrainBranchBot) [00:41:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P48971 and previous config saved to /var/cache/conftool/dbconfig/20230607-004148-ladsgroup.json [00:47:24] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:05] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:51:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P48972 and previous config saved to /var/cache/conftool/dbconfig/20230607-005155-ladsgroup.json [00:51:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [00:51:59] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:52:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [00:54:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [00:54:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [00:56:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [00:56:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T336886)', diff saved to https://phabricator.wikimedia.org/P48973 and previous config saved to /var/cache/conftool/dbconfig/20230607-005654-ladsgroup.json [00:56:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [00:57:03] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:57:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [00:57:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [00:57:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T336886)', diff saved to https://phabricator.wikimedia.org/P48974 and previous config saved to /var/cache/conftool/dbconfig/20230607-005713-ladsgroup.json [00:57:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T336886)', diff saved to https://phabricator.wikimedia.org/P48975 and previous config saved to /var/cache/conftool/dbconfig/20230607-005722-ladsgroup.json [00:57:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927773 (owner: 10TrainBranchBot) [00:58:15] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10KFrancis) The agreement has been sent for signatures. I'll update when it's complete. Thanks! [01:00:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T336886)', diff saved to https://phabricator.wikimedia.org/P48976 and previous config saved to /var/cache/conftool/dbconfig/20230607-010047-ladsgroup.json [01:00:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T336886)', diff saved to https://phabricator.wikimedia.org/P48977 and previous config saved to /var/cache/conftool/dbconfig/20230607-010055-ladsgroup.json [01:01:22] PROBLEM - PHP opcache health on mw1445 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:03:44] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:16] RECOVERY - PHP opcache health on mw1461 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:11:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:12:54] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P48978 and previous config saved to /var/cache/conftool/dbconfig/20230607-011553-ladsgroup.json [01:16:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P48979 and previous config saved to /var/cache/conftool/dbconfig/20230607-011602-ladsgroup.json [01:28:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P48980 and previous config saved to /var/cache/conftool/dbconfig/20230607-013059-ladsgroup.json [01:31:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P48981 and previous config saved to /var/cache/conftool/dbconfig/20230607-013108-ladsgroup.json [01:35:24] (03CR) 10Ssingh: [C: 03+1] add app.dev.learn.wiki pointing to AWS [dns] - 10https://gerrit.wikimedia.org/r/927798 (https://phabricator.wikimedia.org/T338280) (owner: 10Dzahn) [01:38:05] (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch dzahn!" [dns] - 10https://gerrit.wikimedia.org/r/927798 (https://phabricator.wikimedia.org/T338280) (owner: 10Dzahn) [01:38:07] (03CR) 10Ssingh: [C: 03+2] add app.dev.learn.wiki pointing to AWS [dns] - 10https://gerrit.wikimedia.org/r/927798 (https://phabricator.wikimedia.org/T338280) (owner: 10Dzahn) [01:39:01] !log run authdns-update: T338280 [01:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:04] T338280: Additional DNS entry for WikiLearn - https://phabricator.wikimedia.org/T338280 [01:41:01] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Papaul) @Jclark-ctr ` spicerack.remote.RemoteCheckError: Reboot for dbproxy1022.eqiad.wmnet not found yet, keep polling for it: unable to get uptime `` when you... [01:42:42] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Additional DNS entry for WikiLearn - https://phabricator.wikimedia.org/T338280 (10ssingh) ` $ dig app.dev.learn.wiki +short 52.44.207.59 ` Thanks to @Dzahn for the patch! [01:42:49] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Additional DNS entry for WikiLearn - https://phabricator.wikimedia.org/T338280 (10ssingh) 05Open→03Resolved a:03ssingh [01:46:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T336886)', diff saved to https://phabricator.wikimedia.org/P48982 and previous config saved to /var/cache/conftool/dbconfig/20230607-014605-ladsgroup.json [01:46:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [01:46:09] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:46:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T336886)', diff saved to https://phabricator.wikimedia.org/P48983 and previous config saved to /var/cache/conftool/dbconfig/20230607-014614-ladsgroup.json [01:46:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [01:46:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [01:46:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1209 (T336886)', diff saved to https://phabricator.wikimedia.org/P48984 and previous config saved to /var/cache/conftool/dbconfig/20230607-014626-ladsgroup.json [01:46:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [01:46:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T336886)', diff saved to https://phabricator.wikimedia.org/P48985 and previous config saved to /var/cache/conftool/dbconfig/20230607-014635-ladsgroup.json [01:50:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T336886)', diff saved to https://phabricator.wikimedia.org/P48986 and previous config saved to /var/cache/conftool/dbconfig/20230607-015012-ladsgroup.json [01:50:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T336886)', diff saved to https://phabricator.wikimedia.org/P48987 and previous config saved to /var/cache/conftool/dbconfig/20230607-015043-ladsgroup.json [02:00:48] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P48988 and previous config saved to /var/cache/conftool/dbconfig/20230607-020518-ladsgroup.json [02:05:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P48989 and previous config saved to /var/cache/conftool/dbconfig/20230607-020550-ladsgroup.json [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P48990 and previous config saved to /var/cache/conftool/dbconfig/20230607-022031-ladsgroup.json [02:20:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P48991 and previous config saved to /var/cache/conftool/dbconfig/20230607-022057-ladsgroup.json [02:26:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:18] PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T336886)', diff saved to https://phabricator.wikimedia.org/P48992 and previous config saved to /var/cache/conftool/dbconfig/20230607-023537-ladsgroup.json [02:35:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [02:35:41] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [02:35:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [02:35:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [02:36:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T336886)', diff saved to https://phabricator.wikimedia.org/P48993 and previous config saved to /var/cache/conftool/dbconfig/20230607-023603-ladsgroup.json [02:36:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [02:36:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [02:36:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T336886)', diff saved to https://phabricator.wikimedia.org/P48994 and previous config saved to /var/cache/conftool/dbconfig/20230607-023613-ladsgroup.json [02:36:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [02:36:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1211 (T336886)', diff saved to https://phabricator.wikimedia.org/P48995 and previous config saved to /var/cache/conftool/dbconfig/20230607-023624-ladsgroup.json [02:36:50] RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:38:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T336886)', diff saved to https://phabricator.wikimedia.org/P48996 and previous config saved to /var/cache/conftool/dbconfig/20230607-023848-ladsgroup.json [02:39:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T336886)', diff saved to https://phabricator.wikimedia.org/P48997 and previous config saved to /var/cache/conftool/dbconfig/20230607-023943-ladsgroup.json [02:51:40] RECOVERY - PHP opcache health on mw1445 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:53:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P48998 and previous config saved to /var/cache/conftool/dbconfig/20230607-025355-ladsgroup.json [02:54:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P48999 and previous config saved to /var/cache/conftool/dbconfig/20230607-025449-ladsgroup.json [03:09:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P49000 and previous config saved to /var/cache/conftool/dbconfig/20230607-030901-ladsgroup.json [03:09:09] (03PS1) 10Jameel Kaisar: Increase NetworkProbeLimit 10x [puppet] - 10https://gerrit.wikimedia.org/r/927809 (https://phabricator.wikimedia.org/T335637) [03:09:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P49001 and previous config saved to /var/cache/conftool/dbconfig/20230607-030955-ladsgroup.json [03:10:36] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927809 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [03:11:53] (03PS3) 10KartikMistry: Use direct Parsoid in Small and Medium Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925742 (https://phabricator.wikimedia.org/T337922) [03:13:23] (03PS2) 10Jameel Kaisar: Increase NetworkProbeLimit 10x [puppet] - 10https://gerrit.wikimedia.org/r/927809 (https://phabricator.wikimedia.org/T335637) [03:14:21] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927809 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [03:16:32] (03CR) 10Jameel Kaisar: [C: 03+1] Increase NetworkProbeLimit 10x [puppet] - 10https://gerrit.wikimedia.org/r/927809 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [03:24:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T336886)', diff saved to https://phabricator.wikimedia.org/P49002 and previous config saved to /var/cache/conftool/dbconfig/20230607-032407-ladsgroup.json [03:24:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:24:11] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [03:24:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:24:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P49003 and previous config saved to /var/cache/conftool/dbconfig/20230607-032428-ladsgroup.json [03:25:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T336886)', diff saved to https://phabricator.wikimedia.org/P49004 and previous config saved to /var/cache/conftool/dbconfig/20230607-032501-ladsgroup.json [03:25:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [03:25:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [03:25:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1214 (T336886)', diff saved to https://phabricator.wikimedia.org/P49005 and previous config saved to /var/cache/conftool/dbconfig/20230607-032522-ladsgroup.json [03:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P49006 and previous config saved to /var/cache/conftool/dbconfig/20230607-032808-ladsgroup.json [03:28:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T336886)', diff saved to https://phabricator.wikimedia.org/P49007 and previous config saved to /var/cache/conftool/dbconfig/20230607-032839-ladsgroup.json [03:37:34] RECOVERY - Check systemd state on vrts2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P49008 and previous config saved to /var/cache/conftool/dbconfig/20230607-034314-ladsgroup.json [03:43:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P49009 and previous config saved to /var/cache/conftool/dbconfig/20230607-034345-ladsgroup.json [03:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P49010 and previous config saved to /var/cache/conftool/dbconfig/20230607-035820-ladsgroup.json [03:58:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P49011 and previous config saved to /var/cache/conftool/dbconfig/20230607-035851-ladsgroup.json [04:00:30] RECOVERY - PHP opcache health on mw1467 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:06:11] (03PS4) 10KartikMistry: Update MinT to 2023-06-06-120533-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927160 (https://phabricator.wikimedia.org/T337910) [04:13:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P49012 and previous config saved to /var/cache/conftool/dbconfig/20230607-041326-ladsgroup.json [04:13:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [04:13:31] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [04:13:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [04:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T336886)', diff saved to https://phabricator.wikimedia.org/P49013 and previous config saved to /var/cache/conftool/dbconfig/20230607-041347-ladsgroup.json [04:13:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T336886)', diff saved to https://phabricator.wikimedia.org/P49014 and previous config saved to /var/cache/conftool/dbconfig/20230607-041357-ladsgroup.json [04:13:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [04:14:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [04:15:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [04:15:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [04:17:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:17:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [04:17:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T336886)', diff saved to https://phabricator.wikimedia.org/P49015 and previous config saved to /var/cache/conftool/dbconfig/20230607-041719-ladsgroup.json [04:18:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [04:18:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [04:20:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [04:20:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [04:20:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T336886)', diff saved to https://phabricator.wikimedia.org/P49016 and previous config saved to /var/cache/conftool/dbconfig/20230607-042040-ladsgroup.json [04:20:43] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [04:23:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T336886)', diff saved to https://phabricator.wikimedia.org/P49017 and previous config saved to /var/cache/conftool/dbconfig/20230607-042304-ladsgroup.json [04:29:05] * kart_ updating MinT [04:29:23] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-06-06-120533-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927160 (https://phabricator.wikimedia.org/T337910) (owner: 10KartikMistry) [04:30:11] (03Merged) 10jenkins-bot: Update MinT to 2023-06-06-120533-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927160 (https://phabricator.wikimedia.org/T337910) (owner: 10KartikMistry) [04:31:14] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [04:32:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P49018 and previous config saved to /var/cache/conftool/dbconfig/20230607-043225-ladsgroup.json [04:32:34] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [04:36:42] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [04:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P49019 and previous config saved to /var/cache/conftool/dbconfig/20230607-043810-ladsgroup.json [04:39:34] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [04:45:41] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [04:47:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P49020 and previous config saved to /var/cache/conftool/dbconfig/20230607-044731-ladsgroup.json [04:48:55] (03PS1) 10KartikMistry: Update cxserver to 2023-06-07-044025-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927812 (https://phabricator.wikimedia.org/T337290) [04:51:41] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [04:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P49021 and previous config saved to /var/cache/conftool/dbconfig/20230607-045317-ladsgroup.json [05:02:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T336886)', diff saved to https://phabricator.wikimedia.org/P49022 and previous config saved to /var/cache/conftool/dbconfig/20230607-050237-ladsgroup.json [05:02:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [05:02:41] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [05:02:51] !log Updated MinT to 2023-06-06-120533-production (T337910, T337686, T337708) [05:02:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [05:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:56] T337910: Adjust Norwegian mapping for MinT configuration - https://phabricator.wikimedia.org/T337910 [05:02:56] T337708: MinT translates en dash to ?? - https://phabricator.wikimedia.org/T337708 [05:02:56] T337686: Issues with apostrophes when translating with MinT - https://phabricator.wikimedia.org/T337686 [05:02:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P49023 and previous config saved to /var/cache/conftool/dbconfig/20230607-050258-ladsgroup.json [05:07:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P49024 and previous config saved to /var/cache/conftool/dbconfig/20230607-050740-ladsgroup.json [05:07:44] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [05:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T336886)', diff saved to https://phabricator.wikimedia.org/P49025 and previous config saved to /var/cache/conftool/dbconfig/20230607-050823-ladsgroup.json [05:08:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [05:08:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [05:08:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T336886)', diff saved to https://phabricator.wikimedia.org/P49026 and previous config saved to /var/cache/conftool/dbconfig/20230607-050844-ladsgroup.json [05:12:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T336886)', diff saved to https://phabricator.wikimedia.org/P49027 and previous config saved to /var/cache/conftool/dbconfig/20230607-051207-ladsgroup.json [05:12:21] And, now updating cxserver.. [05:12:44] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-06-07-044025-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927812 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [05:13:34] (03Merged) 10jenkins-bot: Update cxserver to 2023-06-07-044025-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/927812 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [05:17:01] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:17:19] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:22:22] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:22:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P49028 and previous config saved to /var/cache/conftool/dbconfig/20230607-052247-ladsgroup.json [05:22:57] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:25:10] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:25:44] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:27:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P49029 and previous config saved to /var/cache/conftool/dbconfig/20230607-052713-ladsgroup.json [05:28:33] !log Updated cxserver to 2023-06-07-044025-production (T337290, T337669, T337834) [05:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:38] T337834: Enable MinT, Content and Section Translation for a 3rd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337834 [05:28:39] T337290: Enable MinT, Content and Section Translation for 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337290 [05:28:39] T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337669 [05:37:33] (03CR) 10Ayounsi: [C: 03+2] Add /.vscode/ to .gitignore [cookbooks] - 10https://gerrit.wikimedia.org/r/926493 (owner: 10Ayounsi) [05:37:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P49030 and previous config saved to /var/cache/conftool/dbconfig/20230607-053753-ladsgroup.json [05:40:06] (03Merged) 10jenkins-bot: Add /.vscode/ to .gitignore [cookbooks] - 10https://gerrit.wikimedia.org/r/926493 (owner: 10Ayounsi) [05:42:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P49031 and previous config saved to /var/cache/conftool/dbconfig/20230607-054220-ladsgroup.json [05:52:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T336886)', diff saved to https://phabricator.wikimedia.org/P49032 and previous config saved to /var/cache/conftool/dbconfig/20230607-055259-ladsgroup.json [05:53:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [05:53:03] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [05:53:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [05:53:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T336886)', diff saved to https://phabricator.wikimedia.org/P49033 and previous config saved to /var/cache/conftool/dbconfig/20230607-055320-ladsgroup.json [05:53:38] (03CR) 10Santhosh: [C: 03+1] Use direct Parsoid in Small and Medium Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925742 (https://phabricator.wikimedia.org/T337922) (owner: 10KartikMistry) [05:55:42] PROBLEM - PHP opcache health on mw1439 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T336886)', diff saved to https://phabricator.wikimedia.org/P49034 and previous config saved to /var/cache/conftool/dbconfig/20230607-055655-ladsgroup.json [05:57:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T336886)', diff saved to https://phabricator.wikimedia.org/P49035 and previous config saved to /var/cache/conftool/dbconfig/20230607-055726-ladsgroup.json [05:57:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [05:57:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [05:57:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T336886)', diff saved to https://phabricator.wikimedia.org/P49036 and previous config saved to /var/cache/conftool/dbconfig/20230607-055746-ladsgroup.json [06:00:07] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T0600) [06:01:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T336886)', diff saved to https://phabricator.wikimedia.org/P49037 and previous config saved to /var/cache/conftool/dbconfig/20230607-060112-ladsgroup.json [06:01:16] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:07:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:08:00] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P49038 and previous config saved to /var/cache/conftool/dbconfig/20230607-061203-ladsgroup.json [06:16:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P49039 and previous config saved to /var/cache/conftool/dbconfig/20230607-061618-ladsgroup.json [06:27:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P49040 and previous config saved to /var/cache/conftool/dbconfig/20230607-062709-ladsgroup.json [06:31:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P49041 and previous config saved to /var/cache/conftool/dbconfig/20230607-063125-ladsgroup.json [06:42:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T336886)', diff saved to https://phabricator.wikimedia.org/P49042 and previous config saved to /var/cache/conftool/dbconfig/20230607-064215-ladsgroup.json [06:42:19] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:42:56] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:25] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10KCVelaga_WMF) Thank you @cmooney and @mpopov! I got an error(?) when I ran the command Mikhail shared ` sudo: effective uid is not 0, is /usr/bin... [06:45:36] (03PS1) 10Slyngshede: Signup: Add email validator for signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/927968 [06:46:21] (03Abandoned) 10Slyngshede: C:IDM Allow Bitu library to write to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/927672 (owner: 10Slyngshede) [06:46:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T336886)', diff saved to https://phabricator.wikimedia.org/P49043 and previous config saved to /var/cache/conftool/dbconfig/20230607-064631-ladsgroup.json [06:46:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [06:46:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [06:46:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T336886)', diff saved to https://phabricator.wikimedia.org/P49044 and previous config saved to /var/cache/conftool/dbconfig/20230607-064652-ladsgroup.json [06:47:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This only fixes one specific case - the error handling." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [06:50:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T336886)', diff saved to https://phabricator.wikimedia.org/P49045 and previous config saved to /var/cache/conftool/dbconfig/20230607-065015-ladsgroup.json [06:50:18] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [07:00:04] Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:21] 0/ [07:01:57] I'll go ahead with deployment. [07:03:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925742 (https://phabricator.wikimedia.org/T337922) (owner: 10KartikMistry) [07:03:50] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:04:01] (03Merged) 10jenkins-bot: Use direct Parsoid in Small and Medium Wikis for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925742 (https://phabricator.wikimedia.org/T337922) (owner: 10KartikMistry) [07:04:45] !log kartik@deploy1002 Started scap: Backport for [[gerrit:925742|Use direct Parsoid in Small and Medium Wikis for Content Translation (T337922)]] [07:04:49] T337922: Use Parsoid in Small and Medium Wikis for Content Translation - https://phabricator.wikimedia.org/T337922 [07:05:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P49046 and previous config saved to /var/cache/conftool/dbconfig/20230607-070521-ladsgroup.json [07:06:12] !log kartik@deploy1002 kartik: Backport for [[gerrit:925742|Use direct Parsoid in Small and Medium Wikis for Content Translation (T337922)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [07:11:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:15:19] I'm still testing my patch on mwdebugs* [07:20:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P49047 and previous config saved to /var/cache/conftool/dbconfig/20230607-072027-ladsgroup.json [07:22:52] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:925742|Use direct Parsoid in Small and Medium Wikis for Content Translation (T337922)]] (duration: 18m 06s) [07:22:56] T337922: Use Parsoid in Small and Medium Wikis for Content Translation - https://phabricator.wikimedia.org/T337922 [07:23:06] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: Fix /excimer/ POST restriction [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [07:25:01] (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch: disable security plugin on codfw [puppet] - 10https://gerrit.wikimedia.org/r/927771 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [07:25:24] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: bump logstash kafka lag threshold [alerts] - 10https://gerrit.wikimedia.org/r/927626 (owner: 10Filippo Giunchedi) [07:26:51] (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch: clean up hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/927769 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [07:27:23] (03CR) 10Filippo Giunchedi: "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/925120 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [07:27:29] (03CR) 10Filippo Giunchedi: [C: 03+1] lvs: remove lvs::monitor_services [puppet] - 10https://gerrit.wikimedia.org/r/925120 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [07:28:21] I'm done with deployment (Forgot to update) [07:30:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/927745 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [07:31:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/927746 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [07:31:49] (03PS1) 10Slyngshede: C:idm:deployment restart services on reconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/927969 [07:32:11] (03CR) 10CI reject: [V: 04-1] C:idm:deployment restart services on reconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/927969 (owner: 10Slyngshede) [07:32:46] (03PS2) 10Slyngshede: C:idm:deployment restart services on reconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/927969 [07:35:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T336886)', diff saved to https://phabricator.wikimedia.org/P49048 and previous config saved to /var/cache/conftool/dbconfig/20230607-073533-ladsgroup.json [07:35:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [07:35:37] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [07:35:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [07:35:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T336886)', diff saved to https://phabricator.wikimedia.org/P49049 and previous config saved to /var/cache/conftool/dbconfig/20230607-073554-ladsgroup.json [07:37:06] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] profile: exclude kubelet production hosts from cadvisor rollout [puppet] - 10https://gerrit.wikimedia.org/r/927198 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [07:38:59] (03PS1) 10Jcrespo: icinga: Remove references to andy before removing the icinga contact [puppet] - 10https://gerrit.wikimedia.org/r/927970 [07:39:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T336886)', diff saved to https://phabricator.wikimedia.org/P49050 and previous config saved to /var/cache/conftool/dbconfig/20230607-073916-ladsgroup.json [07:41:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/927970 (owner: 10Jcrespo) [07:42:00] (03CR) 10Jcrespo: [C: 03+2] icinga: Remove references to andy before removing the icinga contact [puppet] - 10https://gerrit.wikimedia.org/r/927970 (owner: 10Jcrespo) [07:42:37] (03PS1) 10David Caro: wmcs.instance: pin ruby2.5 [puppet] - 10https://gerrit.wikimedia.org/r/927971 (https://phabricator.wikimedia.org/T338294) [07:42:48] (03PS1) 10Filippo Giunchedi: base: bump cadvisor rollout to 20% in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/927972 (https://phabricator.wikimedia.org/T108027) [07:43:02] (03CR) 10CI reject: [V: 04-1] wmcs.instance: pin ruby2.5 [puppet] - 10https://gerrit.wikimedia.org/r/927971 (https://phabricator.wikimedia.org/T338294) (owner: 10David Caro) [07:45:18] (03PS2) 10David Caro: wmcs.instance: pin ruby2.5 [puppet] - 10https://gerrit.wikimedia.org/r/927971 (https://phabricator.wikimedia.org/T338294) [07:51:03] (03PS3) 10David Caro: wmcs.instance: pin ruby2.5 [puppet] - 10https://gerrit.wikimedia.org/r/927971 (https://phabricator.wikimedia.org/T338294) [07:54:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P49051 and previous config saved to /var/cache/conftool/dbconfig/20230607-075422-ladsgroup.json [07:56:11] (03CR) 10Filippo Giunchedi: [C: 03+2] base: bump cadvisor rollout to 20% in eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/927972 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:00:00] (03PS1) 10Slyngshede: Blocklists: Fix error in regex reader. [software/bitu] - 10https://gerrit.wikimedia.org/r/927974 [08:00:42] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Blocklists: Fix error in regex reader. [software/bitu] - 10https://gerrit.wikimedia.org/r/927974 (owner: 10Slyngshede) [08:07:29] (03CR) 10Jelto: [C: 03+2] gitlab: move gitlab to test idp [puppet] - 10https://gerrit.wikimedia.org/r/927602 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [08:09:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P49052 and previous config saved to /var/cache/conftool/dbconfig/20230607-080928-ladsgroup.json [08:11:07] (03PS1) 10Elukey: analytics refinery: add a data purge job for webrequest_sampled_live [puppet] - 10https://gerrit.wikimedia.org/r/927976 (https://phabricator.wikimedia.org/T337460) [08:11:31] (03CR) 10Muehlenhoff: C:idm switch to read/write user for LDAP access. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927752 (owner: 10Slyngshede) [08:13:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41594/console" [puppet] - 10https://gerrit.wikimedia.org/r/927976 (https://phabricator.wikimedia.org/T337460) (owner: 10Elukey) [08:19:02] (03PS3) 10Giuseppe Lavagetto: poolcounter: Make it release before closing connection [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [08:19:04] (03PS1) 10Giuseppe Lavagetto: Poolcounter.release: don't reconnect if the stream is lost [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927978 [08:19:06] (03PS1) 10Giuseppe Lavagetto: Also add Poolcounter.release() to on_finish [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927979 [08:21:20] (03PS1) 10Hashar: zuul: remove mode/umask from config git clone [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) [08:21:39] 10Puppet, 10Release-Engineering-Team, 10Patch-For-Review: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277 (10hashar) a:03hashar [08:22:01] (03CR) 10Vgutierrez: [C: 03+1] "LGTM; please stop puppet first in A:cp-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/927715 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:22:27] !log uploaded ruby 2.5.5-3+deb10u5+wmf1 to apt.wikimedia.org, unbreaking Puppet runs after latest Ruby update for Buster T338294 [08:22:28] (03CR) 10Fabfur: hiera: Swap port 80 from varnish to haproxy in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927715 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:31] T338294: ruby2.5 2.5.5-3+deb10u5 breaks Puppet - https://phabricator.wikimedia.org/T338294 [08:24:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T336886)', diff saved to https://phabricator.wikimedia.org/P49053 and previous config saved to /var/cache/conftool/dbconfig/20230607-082434-ladsgroup.json [08:24:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [08:24:38] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [08:24:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [08:24:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:24:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:25:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T336886)', diff saved to https://phabricator.wikimedia.org/P49054 and previous config saved to /var/cache/conftool/dbconfig/20230607-082500-ladsgroup.json [08:27:07] (03CR) 10Hashar: "I have found that one when investigating why we can't build dev-images docker-pkg images (due to git safe.directory)." [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [08:28:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T336886)', diff saved to https://phabricator.wikimedia.org/P49055 and previous config saved to /var/cache/conftool/dbconfig/20230607-082823-ladsgroup.json [08:29:33] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir [08:34:14] !log disable puppet on A:cp-eqiad for varnish <-> haproxy port 80 swap [08:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:31] RECOVERY - PHP opcache health on mw1494 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:34:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Removing my -1 as I added release to one more place. I still feel the whole system is deeply flawed here." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [08:38:59] (03PS1) 10Giuseppe Lavagetto: thumbor: allow changing poolcounter's release timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/927981 (https://phabricator.wikimedia.org/T337649) [08:41:04] (03PS3) 10Hashar: contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) [08:42:34] (03PS4) 10Giuseppe Lavagetto: poolcounter: Make it release before closing connection [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [08:42:36] (03PS2) 10Giuseppe Lavagetto: Poolcounter.release: don't reconnect if the stream is lost [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927978 (https://phabricator.wikimedia.org/T337649) [08:42:38] (03PS2) 10Giuseppe Lavagetto: Also add Poolcounter.release() to on_finish [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927979 (https://phabricator.wikimedia.org/T337649) [08:43:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P49056 and previous config saved to /var/cache/conftool/dbconfig/20230607-084329-ladsgroup.json [08:44:33] (03CR) 10Hashar: "That follows my comment on parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/927980/" [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [08:47:24] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: allow changing poolcounter's release timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/927981 (https://phabricator.wikimedia.org/T337649) (owner: 10Giuseppe Lavagetto) [08:47:55] (03CR) 10Jbond: apt::repository: remove conflicting .list files from bookworm /etc/apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) (owner: 10Andrew Bogott) [08:48:16] (03Merged) 10jenkins-bot: thumbor: allow changing poolcounter's release timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/927981 (https://phabricator.wikimedia.org/T337649) (owner: 10Giuseppe Lavagetto) [08:49:15] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [08:49:19] (03CR) 10Jbond: [C: 03+1] C:idm:deployment restart services on reconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/927969 (owner: 10Slyngshede) [08:50:03] (03CR) 10Fabfur: [C: 03+2] hiera: Swap port 80 from varnish to haproxy in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/927715 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:52:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/927969 (owner: 10Slyngshede) [08:54:43] (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment restart services on reconfiguration [puppet] - 10https://gerrit.wikimedia.org/r/927969 (owner: 10Slyngshede) [08:56:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/927745 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [08:56:51] (03PS2) 10Stevemunene: Decommission analytics1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/927667 (https://phabricator.wikimedia.org/T338227) [08:58:02] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/927667 (https://phabricator.wikimedia.org/T338227) (owner: 10Stevemunene) [08:58:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P49057 and previous config saved to /var/cache/conftool/dbconfig/20230607-085835-ladsgroup.json [08:58:55] (03CR) 10Stevemunene: [C: 03+2] Decommission analytics1058 from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/927667 (https://phabricator.wikimedia.org/T338227) (owner: 10Stevemunene) [08:59:21] !log fabfur@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on A:cp-text_eqiad and A:cp [08:59:28] !log fabfur@cumin1001 START - Cookbook sre.cdn.run-puppet-restart-varnish rolling custom on A:cp-upload_eqiad and A:cp [09:00:16] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [09:06:06] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [09:07:05] PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:07:24] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [09:08:53] (03PS1) 10Jelto: gitlab: remove duplicate / in redirect_uri [puppet] - 10https://gerrit.wikimedia.org/r/927988 (https://phabricator.wikimedia.org/T320390) [09:10:34] (03PS2) 10Jelto: gitlab: remove duplicate / in redirect_uri [puppet] - 10https://gerrit.wikimedia.org/r/927988 (https://phabricator.wikimedia.org/T320390) [09:11:38] (03CR) 10Jbond: Create cookbook to upgrade Apache Traffic Server (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [09:13:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T336886)', diff saved to https://phabricator.wikimedia.org/P49058 and previous config saved to /var/cache/conftool/dbconfig/20230607-091341-ladsgroup.json [09:13:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [09:13:45] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [09:13:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [09:14:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T336886)', diff saved to https://phabricator.wikimedia.org/P49059 and previous config saved to /var/cache/conftool/dbconfig/20230607-091402-ladsgroup.json [09:16:50] (03PS1) 10Vgutierrez: fifo_log_demux: Fix systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) [09:17:28] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [09:17:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T336886)', diff saved to https://phabricator.wikimedia.org/P49060 and previous config saved to /var/cache/conftool/dbconfig/20230607-091728-ladsgroup.json [09:21:05] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [09:21:39] (03PS1) 10Slyngshede: Static assets: Do not hotlink to commons. [software/bitu] - 10https://gerrit.wikimedia.org/r/927991 [09:23:57] (03CR) 10Vgutierrez: "This is currently impacting ncredir metrics / https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview?orgId=1" [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [09:24:27] (03CR) 10Elukey: [C: 03+2] varnishkafka: add catch all systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/924506 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [09:25:33] RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:26:38] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q4), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) I've bumped `cadvisor` rollout to 20% in codfw/eqiad, for a total of ~900 hosts... [09:29:17] PROBLEM - Check systemd state on analytics1058 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:51] PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:31:36] (03PS1) 10Elukey: Revert "varnishkafka: add catch all systemd unit" [puppet] - 10https://gerrit.wikimedia.org/r/927698 [09:32:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P49061 and previous config saved to /var/cache/conftool/dbconfig/20230607-093234-ladsgroup.json [09:33:00] (03CR) 10Elukey: [C: 03+2] Revert "varnishkafka: add catch all systemd unit" [puppet] - 10https://gerrit.wikimedia.org/r/927698 (owner: 10Elukey) [09:33:12] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir [09:34:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41595/console" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927789 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [09:35:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolserver_legacy: Remove exim4 service [puppet] - 10https://gerrit.wikimedia.org/r/916877 (https://phabricator.wikimedia.org/T136225) (owner: 10Majavah) [09:36:10] (03Abandoned) 10David Caro: wmcs.instance: pin ruby2.5 [puppet] - 10https://gerrit.wikimedia.org/r/927971 (https://phabricator.wikimedia.org/T338294) (owner: 10David Caro) [09:40:47] (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "openstack: rabbitmq: simplify cloud-private-subnet firewalling support"" [puppet] - 10https://gerrit.wikimedia.org/r/927699 (https://phabricator.wikimedia.org/T338125) [09:44:07] (03CR) 10Jbond: [V: 03+1] "Sorry for the double post seems pcc posted my drafts (TIL). Thanks for working on this i have tried before and never managed to get it ov" [puppet] - 10https://gerrit.wikimedia.org/r/927789 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [09:45:24] (03CR) 10Vgutierrez: "as Fabrizio brought to my attention the man page (systemd.unit) states that RequiredBy= are used in [Init] and [Install] but we are gettin" [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [09:47:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P49062 and previous config saved to /var/cache/conftool/dbconfig/20230607-094740-ladsgroup.json [09:50:53] (03CR) 10Btullis: Update the maintain-views script to improve the table selection option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927723 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [09:52:27] (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "openstack: rabbitmq: simplify cloud-private-subnet firewalling support"" [puppet] - 10https://gerrit.wikimedia.org/r/927699 (https://phabricator.wikimedia.org/T338125) [09:55:51] RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:56:23] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "This PCC is more in line with what we are looking for:" [puppet] - 10https://gerrit.wikimedia.org/r/927699 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [10:00:05] RECOVERY - PHP opcache health on mw1439 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1000) [10:02:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T336886)', diff saved to https://phabricator.wikimedia.org/P49063 and previous config saved to /var/cache/conftool/dbconfig/20230607-100247-ladsgroup.json [10:02:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [10:02:51] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:03:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [10:03:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T336886)', diff saved to https://phabricator.wikimedia.org/P49064 and previous config saved to /var/cache/conftool/dbconfig/20230607-100307-ladsgroup.json [10:05:23] (03PS1) 10Clément Goubert: envoy: Add connection tracking to drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/927997 (https://phabricator.wikimedia.org/T338014) [10:05:26] (03PS1) 10Jelto: admin: add all miscweb domains as extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) [10:06:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T336886)', diff saved to https://phabricator.wikimedia.org/P49065 and previous config saved to /var/cache/conftool/dbconfig/20230607-100635-ladsgroup.json [10:06:42] (03PS2) 10Clément Goubert: envoy: Add connection tracking to drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/927997 (https://phabricator.wikimedia.org/T338014) [10:10:06] (03CR) 10Jelto: "Hi Daniel, can you check if the list of services makes sense for the Kubernetes migration? I commented in-line for services where I'm not " [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [10:13:53] (03Abandoned) 10D3r1ck01: Conversion: Fix regex for body extraction of HTML [extensions/Flow] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927700 (https://phabricator.wikimedia.org/T338264) (owner: 10D3r1ck01) [10:14:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927988 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [10:19:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41598/console" [puppet] - 10https://gerrit.wikimedia.org/r/927788 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [10:19:27] (03PS1) 10D3r1ck01: Enable 'single-line' mode in preg_match for wikitextToHTML regex [extensions/Flow] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927702 (https://phabricator.wikimedia.org/T338264) [10:21:04] (03PS2) 10D3r1ck01: Enable 'multi-line' mode in preg_match() for wikitextToHTML regex [extensions/Flow] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927702 (https://phabricator.wikimedia.org/T338264) [10:21:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P49066 and previous config saved to /var/cache/conftool/dbconfig/20230607-102141-ladsgroup.json [10:22:46] (03PS1) 10Clément Goubert: mediawiki: Graceful termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/927999 (https://phabricator.wikimedia.org/T331609) [10:23:13] (03CR) 10Jbond: [V: 03+1] DO NOT MERGE: apply profile::apt in separate stage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927789 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [10:25:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/927991 (owner: 10Slyngshede) [10:31:07] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Static assets: Do not hotlink to commons. [software/bitu] - 10https://gerrit.wikimedia.org/r/927991 (owner: 10Slyngshede) [10:32:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:36:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P49068 and previous config saved to /var/cache/conftool/dbconfig/20230607-103648-ladsgroup.json [10:37:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:38:43] (03CR) 10Muehlenhoff: Error message: Add custom error messages for 403 and 500. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/927659 (owner: 10Slyngshede) [10:40:46] (03PS6) 10Jbond: wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 [10:43:07] (03CR) 10CI reject: [V: 04-1] wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [10:46:09] (03CR) 10EoghanGaffney: [C: 03+2] releases: clone repos/releng/release from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/925033 (https://phabricator.wikimedia.org/T290260) (owner: 10Reedy) [10:46:31] (03CR) 10Ladsgroup: [C: 03+1] "LGTM, yeah I agree it's a band-aid" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [10:47:06] (03CR) 10Ladsgroup: [C: 03+1] Poolcounter.release: don't reconnect if the stream is lost [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927978 (https://phabricator.wikimedia.org/T337649) (owner: 10Giuseppe Lavagetto) [10:47:28] (03CR) 10Ladsgroup: [C: 03+1] Also add Poolcounter.release() to on_finish [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927979 (https://phabricator.wikimedia.org/T337649) (owner: 10Giuseppe Lavagetto) [10:47:57] (03PS7) 10Jbond: wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 [10:49:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41600/console" [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [10:51:50] (03CR) 10Muehlenhoff: Signup: Add email validator for signup. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/927968 (owner: 10Slyngshede) [10:51:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T336886)', diff saved to https://phabricator.wikimedia.org/P49069 and previous config saved to /var/cache/conftool/dbconfig/20230607-105154-ladsgroup.json [10:51:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [10:51:58] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:52:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [10:52:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T336886)', diff saved to https://phabricator.wikimedia.org/P49070 and previous config saved to /var/cache/conftool/dbconfig/20230607-105215-ladsgroup.json [10:55:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T336886)', diff saved to https://phabricator.wikimedia.org/P49071 and previous config saved to /var/cache/conftool/dbconfig/20230607-105541-ladsgroup.json [10:55:43] RECOVERY - Check systemd state on analytics1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:22] (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927701 (owner: 10Abijeet Patro) [11:01:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) Yes, that would be possible even though there is no documented way on how to do this and what is supported or not. The two main options I see is either via a... [11:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:33] (03CR) 10Jbond: [V: 03+1] "ok pcc is noop now" [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [11:06:49] (03PS8) 10Jbond: wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 [11:07:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:09:02] (03CR) 10Btullis: [C: 03+2] "I've run a successful dry-run test of the updated script: https://phabricator.wikimedia.org/T315426#8908931" [puppet] - 10https://gerrit.wikimedia.org/r/927723 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [11:09:17] (03PS3) 10Clément Goubert: envoy: Add connection tracking to drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/927997 (https://phabricator.wikimedia.org/T338014) [11:09:25] (03CR) 10Btullis: [C: 03+2] Update the maintain-views script to improve the table selection option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927723 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [11:10:34] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10cmooney) Hi @KCVelaga sorry to hear you're having problems. I double-checked on stat1004 and it does show you belonging to the correct group, and... [11:10:45] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: Graceful termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/927999 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [11:10:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P49072 and previous config saved to /var/cache/conftool/dbconfig/20230607-111047-ladsgroup.json [11:11:32] (03PS2) 10FNegri: mariadb: toolsdb: move default-character-set under mysql [puppet] - 10https://gerrit.wikimedia.org/r/926518 (https://phabricator.wikimedia.org/T338307) (owner: 10Majavah) [11:15:17] (03CR) 10Effie Mouzeli: [C: 03+1] envoy: Add connection tracking to drain-envoy.sh (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/927997 (https://phabricator.wikimedia.org/T338014) (owner: 10Clément Goubert) [11:17:15] (03PS4) 10Clément Goubert: envoy: Add connection tracking to drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/927997 (https://phabricator.wikimedia.org/T338014) [11:17:15] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetmaster1005 [11:17:51] (03CR) 10Clément Goubert: envoy: Add connection tracking to drain-envoy.sh (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/927997 (https://phabricator.wikimedia.org/T338014) (owner: 10Clément Goubert) [11:18:13] (03CR) 10Nikerabbit: [C: 03+1] Add channel for TtmServerMessageUpdate of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927701 (owner: 10Abijeet Patro) [11:19:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10jbond) In relation to this i plan to rename theses to use puppetserverNNNN. I'm happy to do this renaming once the server is handed over. however i wanted to p... [11:20:07] ACKNOWLEDGEMENT - SSH on an-worker1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Btullis T338310 Wont power back on. Have raised DC-ops hardware ticket. https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:20:07] ACKNOWLEDGEMENT - Host an-worker1125 is DOWN: PING CRITICAL - Packet loss = 100% Btullis T338310 Wont power back on. Have raised DC-ops hardware ticket. [11:20:17] (03PS2) 10Slyngshede: Error message: Add custom error messages for 403 and 500. [software/bitu] - 10https://gerrit.wikimedia.org/r/927659 [11:20:35] (03PS2) 10Clément Goubert: mediawiki: Graceful termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/927999 (https://phabricator.wikimedia.org/T331609) [11:20:45] (03CR) 10Slyngshede: Error message: Add custom error messages for 403 and 500. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/927659 (owner: 10Slyngshede) [11:20:56] (03PS1) 10Jbond: site.pp: drop pupetmasters[12]005 [puppet] - 10https://gerrit.wikimedia.org/r/928005 (https://phabricator.wikimedia.org/T330490) [11:22:05] (03CR) 10Jbond: [C: 03+2] site.pp: drop pupetmasters[12]005 [puppet] - 10https://gerrit.wikimedia.org/r/928005 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:22:21] !log jbond@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster1005 [11:23:29] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts puppetmaster1005 [11:24:07] go [11:24:11] !log jbond@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster2005 [11:24:37] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [11:25:52] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: Graceful termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/927999 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [11:25:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P49073 and previous config saved to /var/cache/conftool/dbconfig/20230607-112553-ladsgroup.json [11:26:20] (03CR) 10Effie Mouzeli: [C: 03+1] envoy: Add connection tracking to drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/927997 (https://phabricator.wikimedia.org/T338014) (owner: 10Clément Goubert) [11:26:38] (03PS2) 10Slyngshede: Signup: Add email validator for signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/927968 [11:27:19] (03PS1) 10Jbond: site.pp: Add definition for puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/928007 (https://phabricator.wikimedia.org/T330490) [11:27:20] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1005 decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [11:27:23] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] envoy: Add connection tracking to drain-envoy.sh [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/927997 (https://phabricator.wikimedia.org/T338014) (owner: 10Clément Goubert) [11:28:29] (03CR) 10Slyngshede: Signup: Add email validator for signup. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/927968 (owner: 10Slyngshede) [11:28:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Kimberly Sarabia - https://phabricator.wikimedia.org/T332042 (10MoritzMuehlenhoff) 05Open→03Resolved The duplicate has been sorted out. [11:29:26] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [11:30:38] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1005 decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [11:30:38] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:30:39] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster1005 [11:30:40] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:30:41] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster2005 [11:30:50] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetmaster1005` - puppetmaster1005 (**WARN**) - Downtimed host on Icinga... [11:30:59] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin2002 for hosts: `puppetmaster2005` - puppetmaster2005 (**WARN**) - Downtimed host on Icinga... [11:31:14] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Graceful termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/927999 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [11:32:29] (03Merged) 10jenkins-bot: mediawiki: Graceful termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/927999 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [11:33:23] (03CR) 10Jelto: [C: 03+2] gitlab: remove duplicate / in redirect_uri [puppet] - 10https://gerrit.wikimedia.org/r/927988 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [11:33:48] jouncebot: nowandnext [11:33:48] No deployments scheduled for the next 1 hour(s) and 26 minute(s) [11:33:48] In 1 hour(s) and 26 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1300) [11:35:04] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:35:04] (03CR) 10FNegri: [C: 03+2] mariadb: toolsdb: move default-character-set under mysql [puppet] - 10https://gerrit.wikimedia.org/r/926518 (https://phabricator.wikimedia.org/T338307) (owner: 10Majavah) [11:35:45] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:39:23] (03PS1) 10Arturo Borrero Gonzalez: openstack: rabbitmq: add rule for cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/928009 (https://phabricator.wikimedia.org/T338125) [11:40:57] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [11:41:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T336886)', diff saved to https://phabricator.wikimedia.org/P49074 and previous config saved to /var/cache/conftool/dbconfig/20230607-114059-ladsgroup.json [11:41:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance [11:41:02] (03CR) 10Jbond: [C: 03+2] site.pp: Add definition for puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/928007 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:41:03] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:41:12] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [11:41:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance [11:41:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T336886)', diff saved to https://phabricator.wikimedia.org/P49075 and previous config saved to /var/cache/conftool/dbconfig/20230607-114120-ladsgroup.json [11:43:08] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [11:43:59] !log jbond@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver1001 [11:44:16] !log jbond@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host puppetserver1001 [11:44:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T336886)', diff saved to https://phabricator.wikimedia.org/P49076 and previous config saved to /var/cache/conftool/dbconfig/20230607-114444-ladsgroup.json [11:45:09] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rename puppetmaster1005 -> puppetserver1001 - jbond@cumin1001" [11:45:28] (03Abandoned) 10Arturo Borrero Gonzalez: Revert "Revert "openstack: rabbitmq: simplify cloud-private-subnet firewalling support"" [puppet] - 10https://gerrit.wikimedia.org/r/927699 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [11:46:15] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rename puppetmaster1005 -> puppetserver1001 - jbond@cumin1001" [11:46:15] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:25] !log jbond@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver2001 [11:46:53] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/928009 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [11:47:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: rabbitmq: add rule for cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/928009 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [11:48:05] !log jbond@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host puppetserver2001 [11:49:51] (03CR) 10Cathal Mooney: [C: 03+1] "Sorry for the delay on this one! Good stuff." [puppet] - 10https://gerrit.wikimedia.org/r/899516 (https://phabricator.wikimedia.org/T331707) (owner: 10Ayounsi) [11:51:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1138.eqiad.wmnet with reason: Maintenance [11:51:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1138.eqiad.wmnet with reason: Maintenance [11:51:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T336886)', diff saved to https://phabricator.wikimedia.org/P49077 and previous config saved to /var/cache/conftool/dbconfig/20230607-115124-ladsgroup.json [11:51:27] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:51:28] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:51:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:51:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:51:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T336886)', diff saved to https://phabricator.wikimedia.org/P49078 and previous config saved to /var/cache/conftool/dbconfig/20230607-115156-ladsgroup.json [11:52:11] (03Abandoned) 10Cathal Mooney: Change check_eth script to work without filter on netdev names [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) (owner: 10Cathal Mooney) [11:52:54] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/output/928009/41601/" [puppet] - 10https://gerrit.wikimedia.org/r/928009 (https://phabricator.wikimedia.org/T338125) (owner: 10Arturo Borrero Gonzalez) [11:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T336886)', diff saved to https://phabricator.wikimedia.org/P49079 and previous config saved to /var/cache/conftool/dbconfig/20230607-115408-ladsgroup.json [11:56:42] (03CR) 10EoghanGaffney: [C: 03+2] releases: Add new hosts to failover servers list [puppet] - 10https://gerrit.wikimedia.org/r/924970 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [11:59:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P49080 and previous config saved to /var/cache/conftool/dbconfig/20230607-115950-ladsgroup.json [12:01:31] (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928013 (https://phabricator.wikimedia.org/T337834) [12:02:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/927968 (owner: 10Slyngshede) [12:04:51] !log jbond@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver2001 [12:06:01] !log jbond@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetserver2001 [12:06:07] !log jbond@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver1001 [12:06:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/927659 (owner: 10Slyngshede) [12:07:02] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Error message: Add custom error messages for 403 and 500. [software/bitu] - 10https://gerrit.wikimedia.org/r/927659 (owner: 10Slyngshede) [12:07:26] !log jbond@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetserver1001 [12:07:54] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [12:09:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P49081 and previous config saved to /var/cache/conftool/dbconfig/20230607-120914-ladsgroup.json [12:09:55] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rename puppetmaster1005 -> puppetserver1001 - jbond@cumin1001" [12:10:59] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: rename puppetmaster1005 -> puppetserver1001 - jbond@cumin1001" [12:10:59] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:11:36] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetserver.eqiad.wmnet on all recursors [12:11:40] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetserver.eqiad.wmnet on all recursors [12:12:12] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup: Add email validator for signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/927968 (owner: 10Slyngshede) [12:12:37] (03PS1) 10Jelto: idp: add gitlab-replicas to gitlab_oidc config [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) [12:12:42] PROBLEM - Check systemd state on releases2003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases2003.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:44] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetserver1001.eqiad.wmnet on all recursors [12:12:47] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetserver1001.eqiad.wmnet on all recursors [12:13:35] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetserver1001.eqiad.wmnet with OS bookworm [12:13:38] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10serviceops-collab, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) After fixing the `redirect_uri` I'm able to login successfully to the admin interface (https://gitlab.wikimedia.org/admin) using... [12:13:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm [12:14:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P49082 and previous config saved to /var/cache/conftool/dbconfig/20230607-121456-ladsgroup.json [12:15:10] RECOVERY - Check systemd state on releases2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:39] (03CR) 10Jbond: idp: add gitlab-replicas to gitlab_oidc config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:20:48] (03PS1) 10Btullis: Update the cumin aliases for the wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) [12:23:10] PROBLEM - Check systemd state on releases2003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases2003.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:46] PROBLEM - Check systemd state on releases1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-srv-patches-releases1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:50] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: cloudnet: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/928023 (https://phabricator.wikimedia.org/T338314) [12:24:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P49083 and previous config saved to /var/cache/conftool/dbconfig/20230607-122420-ladsgroup.json [12:26:40] RECOVERY - Check systemd state on releases1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:15] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) [12:27:30] RECOVERY - Check systemd state on releases2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:17] (03CR) 10Jelto: idp: add gitlab-replicas to gitlab_oidc config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928017 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [12:29:02] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/928023 (https://phabricator.wikimedia.org/T338314) (owner: 10Arturo Borrero Gonzalez) [12:30:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T336886)', diff saved to https://phabricator.wikimedia.org/P49084 and previous config saved to /var/cache/conftool/dbconfig/20230607-123002-ladsgroup.json [12:30:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:31:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: cloudnet: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/928023 (https://phabricator.wikimedia.org/T338314) (owner: 10Arturo Borrero Gonzalez) [12:33:49] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [12:36:13] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet - aborrero@cumin2002" [12:37:19] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet - aborrero@cumin2002" [12:37:19] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:37:20] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10Andrew) 05Open→03Resolved [12:39:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T336886)', diff saved to https://phabricator.wikimedia.org/P49085 and previous config saved to /var/cache/conftool/dbconfig/20230607-123926-ladsgroup.json [12:39:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:39:30] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:39:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:41:08] (03CR) 10Ladsgroup: [C: 03+2] poolcounter: Make it release before closing connection [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [12:41:12] (03CR) 10Ladsgroup: [C: 03+2] Poolcounter.release: don't reconnect if the stream is lost [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927978 (https://phabricator.wikimedia.org/T337649) (owner: 10Giuseppe Lavagetto) [12:41:17] (03CR) 10Ladsgroup: [C: 03+2] Also add Poolcounter.release() to on_finish [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927979 (https://phabricator.wikimedia.org/T337649) (owner: 10Giuseppe Lavagetto) [12:41:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable cache warming jobs for parsoid per default. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927758 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [12:43:31] (03CR) 10Andrew Bogott: apt::repository: remove conflicting .list files from bookworm /etc/apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) (owner: 10Andrew Bogott) [12:45:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:45:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:45:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49086 and previous config saved to /var/cache/conftool/dbconfig/20230607-124543-ladsgroup.json [12:45:44] (03Merged) 10jenkins-bot: poolcounter: Make it release before closing connection [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927767 (https://phabricator.wikimedia.org/T337649) (owner: 10Ladsgroup) [12:45:48] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:46:43] !log mwscript maintenance/storage/moveToExternal.php --iconv DB cluster27 on dawiktionary and svwiktionary (T128155 and T128156) [12:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:47] T128155: Migrate all old DB rows from windows-1252 to UTF-8 on dawiktionary - https://phabricator.wikimedia.org/T128155 [12:46:47] T128156: Migrate all old DB rows from windows-1252 to UTF-8 on svwiktionary - https://phabricator.wikimedia.org/T128156 [12:47:00] (03Merged) 10jenkins-bot: Poolcounter.release: don't reconnect if the stream is lost [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927978 (https://phabricator.wikimedia.org/T337649) (owner: 10Giuseppe Lavagetto) [12:47:02] (03Merged) 10jenkins-bot: Also add Poolcounter.release() to on_finish [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/927979 (https://phabricator.wikimedia.org/T337649) (owner: 10Giuseppe Lavagetto) [12:49:50] jouncebot: nowandnext [12:49:50] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [12:49:50] In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1300) [12:50:13] !log cmooney@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 [12:50:13] (03PS1) 10Ladsgroup: Revert "Revert "Remove legacy encoding option from dawiktionary"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928041 [12:50:16] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [12:50:23] (03PS2) 10Ladsgroup: Revert "Revert "Remove legacy encoding option from dawiktionary"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928041 [12:50:27] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Remove legacy encoding option from dawiktionary"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928041 (owner: 10Ladsgroup) [12:50:40] !log Depooling lvs1019 to move link from lsw1-f1-eqiad to ssw1-f1-eqiad [12:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:00] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetserver1001.eqiad.wmnet with OS bookworm [12:51:10] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm executed with errors: - puppet... [12:51:29] (03Merged) 10jenkins-bot: Revert "Revert "Remove legacy encoding option from dawiktionary"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928041 (owner: 10Ladsgroup) [12:51:41] (03CR) 10Herron: [C: 03+1] opensearch: disable security plugin on codfw [puppet] - 10https://gerrit.wikimedia.org/r/927771 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [12:51:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T336886)', diff saved to https://phabricator.wikimedia.org/P49087 and previous config saved to /var/cache/conftool/dbconfig/20230607-125145-ladsgroup.json [12:51:49] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:52:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928041 (owner: 10Ladsgroup) [12:52:15] (03CR) 10Herron: [C: 03+1] opensearch: clean up hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/927769 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [12:52:55] (03CR) 10Herron: [C: 03+2] aptrepo: add logrotate bullseye component [puppet] - 10https://gerrit.wikimedia.org/r/927745 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [12:53:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49088 and previous config saved to /var/cache/conftool/dbconfig/20230607-125335-ladsgroup.json [12:54:55] topranks: please ping me once you're done [12:55:09] Amir1: yep will do [12:55:15] won't be long [12:55:38] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [12:55:46] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:55:49] ^^^ this is my work [12:57:36] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1300). [13:00:04] duesen, xSavitar, and Richika R: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:12] o/ [13:00:13] o/ [13:00:26] * Lucas_WMDE looks at jouncebot anyways [13:00:31] all yours [13:00:34] xSavitar: your patch has not been merged into master? [13:00:36] Lucas_WMDE: but you signed up for it! [13:01:13] taavi, yes! We want to make sure it works on .12 before we move it forward as it will ride the train to group 1 today soon [13:01:36] But I tested this locally [13:01:48] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:01:56] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:01:59] !log cmooney@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 (duration: 11m 45s) [13:02:02] Amir1: that's me done now [13:02:02] So you can go ahead and I'm here to test, thanks! [13:02:03] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [13:02:12] awesome [13:02:14] (03CR) 10Klausman: [C: 03+1] ml-services: deploy bloom-3b with AMD GPU support [deployment-charts] - 10https://gerrit.wikimedia.org/r/927620 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [13:02:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [13:02:23] I notice /srv/mediawiki-staging on deploy1002 is one behind upstream [13:02:28] (one *commit) [13:02:30] yeah, that's me [13:02:37] ok [13:02:39] and on top, we had LVS maint right now [13:02:47] I'll be quick [13:02:48] so we shouldn’t deploy just yet? [13:03:07] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:928041|Revert "Revert "Remove legacy encoding option from dawiktionary""]] [13:03:08] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 80 connections established with conf1007.eqiad.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [13:03:15] it was finished like a minute ago. Look at top*ranks message [13:03:23] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bullseye [13:03:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye [13:04:39] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:928041|Revert "Revert "Remove legacy encoding option from dawiktionary""]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:04:51] taavi or Lucas_WMDE, if everything is clear, can you begin with the patch I scheduled? Daniel will be a little late. [13:06:12] xSavitar: is rrana-wmf on IRC? [13:06:24] I’d like to have some confirmation that the +1ed change is okay to deploy in production already [13:06:32] (03CR) 10Btullis: Update the cumin aliases for the wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [13:06:35] Lucas_WMDE, no but I'm on a call with them now and they're seeing these chats. [13:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P49089 and previous config saved to /var/cache/conftool/dbconfig/20230607-130651-ladsgroup.json [13:06:56] Lucas_WMDE, let me do that. [13:07:00] per https://wikitech.wikimedia.org/wiki/Backport_windows all patches being backported should be merged to master and tested on beta first, which makes me very reluctant to deploy that [13:07:13] (03CR) 10D3r1ck01: [C: 03+1] "LGTM! Please go ahead!" [extensions/Flow] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927702 (https://phabricator.wikimedia.org/T338264) (owner: 10D3r1ck01) [13:07:36] also please get everyone involved in the deployment on this IRC channel [13:07:52] taavi, Okay, let me hit +2 on that, np [13:08:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P49090 and previous config saved to /var/cache/conftool/dbconfig/20230607-130841-ladsgroup.json [13:09:09] taavi, Richika doesn't have an IRC account setup as at now. So I'm showing them to get this patch up. Is that going to be a blocker? [13:09:46] *shadowing [13:10:18] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:928041|Revert "Revert "Remove legacy encoding option from dawiktionary""]] (duration: 07m 11s) [13:10:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:10:41] (03PS2) 10Btullis: Update the cumin aliases for the wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) [13:11:10] the error is reproducible on beta (https://wikidata.beta.wmflabs.org/wiki/Wikidata_talk:Main_Page), so I’d prefer to wait for the master change to merge and then check it there before deploying the backport [13:11:14] we should have enough time in the window, I think [13:11:27] I'm done [13:11:32] thanks [13:11:35] Lucas_WMDE, sure. I'll verify that, thanks! [13:12:40] in the meantime, I guess we wait whether duesen or the Flow gate-and-submit is faster ;) [13:12:59] sure [13:15:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:07] (03PS1) 10Slyngshede: C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 [13:18:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [13:18:34] Lucas_WMDE, it seems Daniel's patch is not yet +2'd [13:18:40] (03CR) 10CI reject: [V: 04-1] C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 (owner: 10Slyngshede) [13:18:43] I saw you only +1'd it [13:18:54] config changes only get +2ed when they’re actually deployed [13:19:06] taavi or I will probably do that [13:19:27] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED [13:19:28] Cool. I was saying so in reference to the message you said above about the "gate-and-submit is faster" [13:19:48] (the same goes for backports, actually, but in order for the deployer to +2 the backport, we generally like to see a +2 on the master version of the change first – whereas for config changes there’s no separate master branch like that) [13:19:59] (03CR) 10Btullis: Update the cumin aliases for the wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [13:19:59] I meant the gate-and-submit of the Flow master change, so we can test it on beta [13:20:01] (03CR) 10Btullis: [C: 03+2] Update the cumin aliases for the wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [13:20:18] (03PS2) 10Slyngshede: C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 [13:20:18] Understood now. [13:20:21] !log removing remote vlan configuration from lsw1-f1-eqiad T322937 [13:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [13:20:28] jouncebot: next [13:20:28] In 3 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1700) [13:20:34] ok, we have some time after the window [13:20:41] (03CR) 10CI reject: [V: 04-1] C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 (owner: 10Slyngshede) [13:20:53] if Flow CI takes so long, the backport gate-and-submit might take too long for the window, but I think it’s fine if we overshoot a bit [13:20:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1027.mgmt.eqiad.wmnet with reboot policy FORCED [13:21:10] (alternatively, we could start the build now, with the option to abort it if the change doesn’t work on beta after all) [13:21:43] Lucas_WMDE, actually, thanks a lot for linking me to the Flow board on wikidata beta. I actually looked for something like that on enwiki beta but didn't find it. [13:21:43] (03PS3) 10Slyngshede: C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 [13:21:54] I would have +2'd the patch to go to beta first before the deploy window. [13:21:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P49092 and previous config saved to /var/cache/conftool/dbconfig/20230607-132158-ladsgroup.json [13:22:04] I just got lucky I guess, it was the first one I checked ^^ [13:22:21] So I got stuck on seeing it work on mediawikiwiki before pushing +2 on master (which is the reverse of what we normally do) [13:22:36] Lucas_WMDE, oh, not luck. Wikidata is your area :D [13:23:41] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41602/console" [puppet] - 10https://gerrit.wikimedia.org/r/928052 (owner: 10Slyngshede) [13:23:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P49093 and previous config saved to /var/cache/conftool/dbconfig/20230607-132348-ladsgroup.json [13:25:56] Lucas_WMDE, remind me how long it takes for beta to pick up changes from master. Is it 10mins or has it gotten less these days? [13:26:05] up to 10mins I think [13:26:12] Okay [13:26:19] (03PS1) 10Krinkle: webperf: Fix /excimer/ POST restriction in prod too [puppet] - 10https://gerrit.wikimedia.org/r/928053 [13:26:50] (03CR) 10Herron: [C: 03+2] mwlog: upgrade logrotate and use ignoreduplicates [puppet] - 10https://gerrit.wikimedia.org/r/927746 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [13:26:51] the timers run every 10 minutes, so something like 15 minutes max if you take the runtime into account [13:27:00] you can see the jobs at https://integration.wikimedia.org/ci/view/Beta/ – the last update started 4 minutes ago so the next one will be 6 minutes from now I believe [13:27:02] (03Abandoned) 10Effie Mouzeli: kubernetes.yaml: add iPoid user/tokens [labs/private] - 10https://gerrit.wikimedia.org/r/922808 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [13:27:25] taavi, thanks! [13:27:42] (03CR) 10CDanis: [C: 03+2] Increase NetworkProbeLimit 10x [puppet] - 10https://gerrit.wikimedia.org/r/927809 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [13:28:35] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1024.mgmt.eqiad.wmnet with reboot policy FORCED [13:29:48] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10mpopov) Hiya! Sure thing: ` $ sudo -u analytics-product kerberos-run-command analytics-product hdfs dfs -ls / Found 5 items drwxr-xr-x - hdfs ha... [13:29:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy1027.eqiad.wmnet with OS bullseye [13:29:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1024.mgmt.eqiad.wmnet with reboot policy FORCED [13:30:12] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bullseye [13:30:17] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: Fix /excimer/ POST restriction in prod too [puppet] - 10https://gerrit.wikimedia.org/r/928053 (owner: 10Krinkle) [13:30:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye [13:31:05] Lucas_WMDE, xSavitar: can we deploy the config patch while we are waiting for your patch to merge? [13:31:23] absolutely. hi! [13:31:38] taavi: want to deploy or should I go ahead? [13:31:43] (03PS2) 10Lucas Werkmeister (WMDE): Enable cache warming jobs for parsoid per default. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927758 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [13:31:50] Lucas_WMDE: go ahead [13:32:01] ok! [13:32:09] duesen, yes I'll confirm with Lucas_WMDE when things work on beta shortly. [13:32:30] Lucas_WMDE: this will put significant load on jobrunners. There is a 20% chance that we need to revert. [13:32:37] duesen: I just noticed in the diffConfig, the StashDuration of a bunch of beta wikis changed https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/3075/console [13:32:46] in addition to the WarmParsoidParserCache: true in production wikis [13:32:59] any idea if that’s fine? [13:33:05] (probably doesn’t need to block this, just checking) [13:33:48] Lucas_WMDE: this is not intentional... maybe a rebase issue? [13:34:00] I was looking at the diffConfig of PS1 [13:34:05] (PS2 hadn’t finished yet) [13:34:11] probably some nuance of how the config overrides each other [13:34:52] oh, beta only overrides 'default' [13:35:14] so I guess previously, beta group0/small/etc. got the production 'group0' entry instead of the IS-labs 'default' entry? [13:35:15] idk [13:35:58] I’ll start the backport [13:36:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927758 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [13:36:09] Lucas_WMDE: yes, the stash duration for beta isn't really relevant [13:36:25] my interpretation is that all beta wikis were supposed to have 2h, and it wasn’t intentional that some of them had 24h after all, so this kinda fixes it [13:36:41] right [13:36:53] (03Merged) 10jenkins-bot: Enable cache warming jobs for parsoid per default. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927758 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [13:37:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T336886)', diff saved to https://phabricator.wikimedia.org/P49095 and previous config saved to /var/cache/conftool/dbconfig/20230607-133704-ladsgroup.json [13:37:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:37:08] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:37:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:37:20] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]] [13:37:23] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [13:37:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T336886)', diff saved to https://phabricator.wikimedia.org/P49096 and previous config saved to /var/cache/conftool/dbconfig/20230607-133725-ladsgroup.json [13:37:49] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10cmooney) Thanks @mpopov. Comparing you two I see you are additionally a member of '//analytics-search-users//', I'm not sure if perhaps that is re... [13:38:48] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1027.eqiad.wmnet'] [13:38:55] !log lucaswerkmeister-wmde@deploy1002 daniel and lucaswerkmeister-wmde: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [13:38:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49097 and previous config saved to /var/cache/conftool/dbconfig/20230607-133854-ladsgroup.json [13:38:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:39:04] duesen: is anything about this testable on mwdebug or do we just roll it out and then watch how the load develops? [13:39:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:39:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:39:21] (03Abandoned) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [13:39:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:39:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbproxy1027.eqiad.wmnet'] [13:39:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T336886)', diff saved to https://phabricator.wikimedia.org/P49098 and previous config saved to /var/cache/conftool/dbconfig/20230607-133933-ladsgroup.json [13:39:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbproxy1027.eqiad.wmnet'] [13:40:20] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['dbproxy1027.eqiad.wmnet'] [13:41:32] duesen: are you testing on mwdebug? [13:42:12] Lucas_WMDE: this affects jobrunenrs only, there is nothing to test on mwdebug [13:42:16] ok, then I’ll sync [13:42:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T336886)', diff saved to https://phabricator.wikimedia.org/P49099 and previous config saved to /var/cache/conftool/dbconfig/20230607-134218-ladsgroup.json [13:42:22] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:42:43] Lucas_WMDE: it's already live for most wikis anyway. The only question is whether this puts too much load on jobrunners. [13:42:48] *nods* [13:44:22] xSavitar: apparetly https://wikidata.beta.wmflabs.org/wiki/Topic:Xjl9u7ra9su73gta worked now \o/ [13:45:17] Lucas_WMDE, yes I can also confirm here: https://en.wikipedia.beta.wmflabs.org/wiki/Topic:Xjl9vhgveuhaomou and https://wikidata.beta.wmflabs.org/wiki/Topic:Xjl9va91jqfjlae6 [13:45:22] Lucas_WMDE, thanks for confirming. [13:45:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "seems to work on Beta, let’s start the gate-and-submit already" [extensions/Flow] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927702 (https://phabricator.wikimedia.org/T338264) (owner: 10D3r1ck01) [13:45:56] * Lucas_WMDE watches the VE Backend Dashboard [13:45:58] (03PS1) 10Cathal Mooney: Change hierdata parents for leaf switches eqiad row F [puppet] - 10https://gerrit.wikimedia.org/r/928056 (https://phabricator.wikimedia.org/T322937) [13:46:08] (03CR) 10JHathaway: [C: 03+1] "looks good, why did you go with wmflib::resource::filter_params rather than wmflib::class::filter_params?" [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [13:46:29] the main one to watch is probably the queue wait time, which we don’t want to grow without limit [13:46:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T336886)', diff saved to https://phabricator.wikimedia.org/P49100 and previous config saved to /var/cache/conftool/dbconfig/20230607-134656-ladsgroup.json [13:46:58] looks like the enqueue rate is starting to head northwards [13:47:06] (php-fpm-restart 80% done) [13:47:11] yes. _joe_ said "call me when it hits five minutes" ;) [13:47:17] hehe [13:47:48] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]] (duration: 10m 27s) [13:47:51] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [13:48:07] ok, let’s continue monitoring before I block my terminal with the Flow backport… [13:48:10] queue wait time as odd spikes, it's <1 second usually, but briefly goes up to 30 minutes every now and then. But nothing in between... [13:48:21] o_O [13:48:52] queue wait time is going up too, but only 3 s so far [13:50:44] job concurrency is up as well, but not dramatically [13:50:50] yeah [13:51:13] (03PS1) 10Muehlenhoff: Allow setting a separate LDAP DN for Bitu LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) [13:51:34] my rough estimate was that all three metrics would roughly double. Let'S ee how that prediction holds up [13:51:36] (03CR) 10CI reject: [V: 04-1] Allow setting a separate LDAP DN for Bitu LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [13:52:10] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10KCVelaga_WMF) I tried running again, on both stat1004 and stat1005. The issue persists. [13:52:16] enqueue rate seems to be leveling off below 200 [13:52:30] concurrency also no longer going up [13:53:04] wait time is high though. >20 sec [13:53:29] 30 now [13:53:50] jobrunners took a hit, but it's leveling off [13:55:59] the wait time graph looks like it’s tapering off, but it has a log scale so that could still be linear for all I know [13:56:04] (03PS2) 10Muehlenhoff: Allow setting a separate LDAP DN for Bitu LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) [13:56:42] yeah with a linear scale it looks like a pretty straight slope… [13:57:21] that would indicate the queue is growing because processing is too slow. [13:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P49101 and previous config saved to /var/cache/conftool/dbconfig/20230607-135724-ladsgroup.json [13:57:29] if that is the case, we have to revert [13:58:35] xSavitar’s backport needs ~5 more minutes according to Zuul (plus then some more time to actually deploy it) [13:58:45] should we do that first and then decide about reverting the config change afterwards? [13:58:52] or should we do the revert sooner? [13:58:58] Lucas_WMDE, I'm still here for sure. [13:59:11] (to me it looks like it’s not super urgent even if we decide to revert after all, but idk) [13:59:58] Lucas_WMDE, 10 mins is not too bad to monitor stuff with the jobs patch (for a bit) [14:00:00] weeee https://grafana.wikimedia.org/d/OxxOv5K4k/ve-backend-dashboard?orgId=1&from=now-3h&to=now&viewPanel=31&refresh=1m [14:00:11] We can move on with the Flow one :) [14:00:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Flow] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927702 (https://phabricator.wikimedia.org/T338264) (owner: 10D3r1ck01) [14:00:46] ok, here goes [14:00:49] Amir1, to the sky we all go :D [14:00:58] Amir1: i'm worried about queue wait. [14:01:00] (03Merged) 10jenkins-bot: Enable 'multi-line' mode in preg_match() for wikitextToHTML regex [extensions/Flow] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/927702 (https://phabricator.wikimedia.org/T338264) (owner: 10D3r1ck01) [14:01:04] (03PS1) 10Jbond: puppetserver: add netboot section for puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/928060 (https://phabricator.wikimedia.org/T330490) [14:01:16] xSavitar: jobrunners wouldn't like that :P [14:01:23] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:927702|Enable 'multi-line' mode in preg_match() for wikitextToHTML regex (T338264)]] [14:01:25] duesen: do you have metrics? [14:01:26] T338264: Caught exception of type Flow\Exception\DataModelException when trying to submit on MediaWiki.org - https://phabricator.wikimedia.org/T338264 [14:01:34] and did you give it concurrency? [14:01:37] Amir1: it's growing linearly. the log scale is deceiving. It's not leveling off. [14:01:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:01:50] (03CR) 10Jbond: [C: 03+2] puppetserver: add netboot section for puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/928060 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:01:51] let me check [14:02:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P49102 and previous config saved to /var/cache/conftool/dbconfig/20230607-140203-ladsgroup.json [14:02:05] did you give it its own lane? If so we can bump it [14:02:37] We didn't tweak the config, _joe_ said we could try it as-is. [14:02:59] !log lucaswerkmeister-wmde@deploy1002 d3r1ck01 and lucaswerkmeister-wmde: Backport for [[gerrit:927702|Enable 'multi-line' mode in preg_match() for wikitextToHTML regex (T338264)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:03:02] But this looks like we do need to give it its own lane (i have no idea how to do that) [14:03:11] Looks at the graphs at the bottom of https://grafana.wikimedia.org/d/OxxOv5K4k/ve-backend-dashboard [14:03:21] xSavitar: can you test on mwdebug? [14:03:28] Lucas_WMDE, okay, on it. [14:03:31] Amir1: jobrunners took a saturation hit around 1347 but it seems to have stabilized [14:03:50] cool [14:03:59] * Lucas_WMDE looks at the chart again [14:04:00] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=5&from=now-1h&to=now [14:04:03] yeah that’s still quite linear [14:04:04] (03CR) 10Volans: Update the cumin aliases for the wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [14:04:04] This is not going down [14:04:09] claime: I think *something* is saturated. They may be waiting on IO. The queue is growing, they are not processing jobs fast enough [14:04:11] let’s revert after the backport [14:04:17] yes. [14:05:01] Lucas_WMDE, it works! [14:05:08] ok, syncing [14:05:10] thanks for testing [14:05:22] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetserver1001.eqiad.wmnet with OS bookworm [14:05:23] Lucas_WMDE, thank you [14:05:45] (03CR) 10Jbond: [C: 03+1] "thanks for the review" [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [14:05:47] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm [14:05:55] (03CR) 10Jbond: [C: 03+2] wmflib: update dump_params and add filter_params [puppet] - 10https://gerrit.wikimedia.org/r/927613 (owner: 10Jbond) [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:48] 10SRE, 10SRE-Access-Requests: Requesting access to wmf MediaWiki history for Tarun Chadha - https://phabricator.wikimedia.org/T337857 (10Aklapper) 05Resolved→03Declined a:05MatthewVernon→03None [14:06:52] Lucas_WMDE: I need to head out soon, will you take care of the revert? [14:06:58] yup, can do [14:07:02] thank you! [14:07:08] I'm about to make the patch [14:07:10] for the lane [14:07:11] <_joe_> the queue is 1 minute. [14:07:21] <_joe_> I am off but please don't revert [14:07:30] <_joe_> thanks Amir1 :) [14:07:40] (03PS4) 10Slyngshede: C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 [14:07:53] now I’m getting mixed messages [14:08:05] Amir1: the lane patch would potentially address the issue and avoid the need for a revert, do I understand correctly? [14:08:12] Lucas_WMDE: yes [14:08:23] still waiting for php-fpm-restart so we have a few minutes to figure this out before I can revert anyways ;) [14:08:24] ok [14:08:35] I was about to say, I'm not seeing any obvious issues if I compare on the last 24 hours [14:08:37] (03PS1) 10Ladsgroup: changeprop-jobqueue: Give parsoidCachePrewarm its own lane [deployment-charts] - 10https://gerrit.wikimedia.org/r/928063 (https://phabricator.wikimedia.org/T320534) [14:08:51] _joe_: while you're off. 32 for all concurrency is good? [14:08:56] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T338236 (10Jclark-ctr) 05Open→03Resolved reseated power cord [14:09:03] judging by this https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=75 [14:09:03] queue wait time went up to an hour yesterday betweemn 1800 and 2000 [14:09:12] We're still *far* off that [14:09:42] (PS1) Ladsgroup: changeprop-jobqueue: Give parsoidCachePrewarm its own lane [deployment-charts] - https://gerrit.wikimedia.org/r/928063 (https://phabricator.wikimedia.org/T320534) [14:09:52] ^ if anyone wants to review the numbers [14:10:03] (03PS3) 10Muehlenhoff: Allow setting a separate LDAP DN for Bitu LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) [14:10:39] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:927702|Enable 'multi-line' mode in preg_match() for wikitextToHTML regex (T338264)]] (duration: 09m 16s) [14:10:43] T338264: Caught exception of type Flow\Exception\DataModelException when trying to submit on MediaWiki.org - https://phabricator.wikimedia.org/T338264 [14:10:57] (03PS5) 10Slyngshede: C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 [14:10:58] I actually should just not parition it [14:11:07] given that it'll be empty in wikidata and commons [14:11:18] Lucas_WMDE, seems the change is live now? [14:11:28] it should be, yeah [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:35] Thank you so much. [14:12:00] Here is my stars for you: ⭐⭐⭐⭐⭐ [14:12:01] sidekiq ? [14:12:03] it looks like the wait time is very slowly tapering off now (even with a linear scale) [14:12:16] what's that? [14:12:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P49103 and previous config saved to /var/cache/conftool/dbconfig/20230607-141230-ladsgroup.json [14:12:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:12:41] (03PS2) 10Ladsgroup: changeprop-jobqueue: Give parsoidCachePrewarm its own lane [deployment-charts] - 10https://gerrit.wikimedia.org/r/928063 (https://phabricator.wikimedia.org/T320534) [14:12:43] jynus: gitlab stuff I think? [14:13:08] https://wikitech.wikimedia.org/wiki/GitLab/Monitoring [14:13:11] ahm prometheus [14:13:15] I was thinking mw job [14:13:22] and got confused [14:13:27] !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet2005-dev [14:13:44] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudnet2005-dev [14:13:49] !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet2006-dev [14:14:01] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudnet2006-dev [14:14:04] !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet2005-dev [14:14:13] Actually, I'm not very smart, that's per-pod concurrency, how many pods there are? [14:14:43] 10SRE, 10Maps: Allow Wikimedia Maps usage on c5.gob.pa - https://phabricator.wikimedia.org/T338069 (10Aklapper) @Pereibri: See the link previously posted by Nemo_bis [14:14:51] !log aborrero@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudnet2005-dev [14:15:11] 10SRE, 10Maps: Allow Wikimedia Maps usage on Mobile Application written with Qt - https://phabricator.wikimedia.org/T338083 (10Aklapper) See https://switch2osm.org/providers/ for the options that you have. [14:15:51] Amir1: 30 [14:16:18] !log aborrero@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet2006-dev [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:33] for CP? that's way too many [14:16:44] helmfile.d/services/changeprop-jobqueue on  T331609 [?] via ⎈ v3.9.4 took 6s [14:16:45] T331609: Gracefully handle pod termination in mw-on-k8s - https://phabricator.wikimedia.org/T331609 [14:16:46] ❯ git grep replicas [14:16:48] values-production.yaml: replicas: 30 [14:17:06] !log aborrero@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudnet2006-dev [14:17:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P49104 and previous config saved to /var/cache/conftool/dbconfig/20230607-141709-ladsgroup.json [14:17:22] (03PS1) 10Muehlenhoff: Add also stub secrets for the mediawiki key for the idm_test role [labs/private] - 10https://gerrit.wikimedia.org/r/928065 (https://phabricator.wikimedia.org/T338008) [14:17:49] Amir1: why is too many? Is there a problem with having too many pods in changeprop-jobqueue ? [14:18:07] standard changeprop (not jobqueue) is 12 [14:18:13] it would made sense for jobrunners themselves [14:18:44] but anyway, maybe I'm missing something. I think 30 is good enough given that other per-edit jobs have 30 too [14:19:52] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:53] like cdnpurge (that even includes refreshlinks jobs) or wikibase-addUsagesForPage (20), cirrusSearchCheckerJob (30) [14:20:48] yeah, ~30 sgtm [14:20:55] (03CR) 10Slyngshede: "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/928065 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:21:53] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add also stub secrets for the mediawiki key for the idm_test role [labs/private] - 10https://gerrit.wikimedia.org/r/928065 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:21:55] claime: do you want to review/deploy it? [14:22:03] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: Give parsoidCachePrewarm its own lane [deployment-charts] - 10https://gerrit.wikimedia.org/r/928063 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [14:22:08] Amir1: I was in the process :P [14:22:14] hahahaha [14:22:15] nice [14:22:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:22:22] we can bump it if it takes longer [14:22:50] I'll let you +2 and I can do the helmfile deploy [14:22:59] Do we need to do the pooling dance ? [14:23:03] I’m still not ruling out a revert, given that the time keeps going up [14:23:14] but I’ll let you try this first ^^ [14:23:23] (03CR) 10Ladsgroup: [C: 03+2] changeprop-jobqueue: Give parsoidCachePrewarm its own lane [deployment-charts] - 10https://gerrit.wikimedia.org/r/928063 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [14:23:35] claime: I don't think so, I've deployed CP before [14:23:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host dbproxy1027.eqiad.wmnet with OS bullseye [14:24:03] if the backlog grows, we can bump it to forty and if that's not enough than a revert [14:24:33] (03Merged) 10jenkins-bot: changeprop-jobqueue: Give parsoidCachePrewarm its own lane [deployment-charts] - 10https://gerrit.wikimedia.org/r/928063 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [14:24:40] btw, it now looks like the wait time isn’t really tapering off after all, it just switched from one linear growth rate to a slightly lower linear growth rate at ca. 14:04 [14:24:47] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetserver1001.eqiad.wmnet with OS bookworm [14:24:52] I leave the deploy to claime [14:24:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm executed with errors: - puppet... [14:25:44] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetserver1001.eqiad.wmnet with OS bookworm [14:25:55] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm [14:26:19] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:27:09] 10ops-eqiad, 10DC-Ops: Relabel: puppetserver1005 to puppetserver1001 - https://phabricator.wikimedia.org/T338326 (10jbond) [14:27:31] 10ops-codfw, 10DC-Ops: Relabel: puppetserver1005 to puppetserver1001 - https://phabricator.wikimedia.org/T338327 (10jbond) [14:27:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T336886)', diff saved to https://phabricator.wikimedia.org/P49106 and previous config saved to /var/cache/conftool/dbconfig/20230607-142736-ladsgroup.json [14:27:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1142.eqiad.wmnet with reason: Maintenance [14:27:40] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:27:43] (03CR) 10David Caro: P:cloudceph::osd: drop the profile::cloudceph::osd::hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [14:27:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1142.eqiad.wmnet with reason: Maintenance [14:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T336886)', diff saved to https://phabricator.wikimedia.org/P49107 and previous config saved to /var/cache/conftool/dbconfig/20230607-142756-ladsgroup.json [14:28:25] Amir1: Great, capacity issues on staging. [14:28:38] lol [14:28:59] can we override it in staging, it's not used much there anyway :/ [14:29:15] yeah, I can just cancel the deployment [14:29:28] (or just let it fail) [14:29:36] I hope I won't have an issue with redeploying prod though [14:29:39] (03CR) 10Ahmon Dancy: [C: 03+1] zuul: remove mode/umask from config git clone [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [14:29:53] fingers crossed [14:31:27] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:31:48] (03CR) 10Ahmon Dancy: [C: 03+1] contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [14:32:04] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:32:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T336886)', diff saved to https://phabricator.wikimedia.org/P49108 and previous config saved to /var/cache/conftool/dbconfig/20230607-143215-ladsgroup.json [14:32:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:32:20] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:32:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:32:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49109 and previous config saved to /var/cache/conftool/dbconfig/20230607-143235-ladsgroup.json [14:32:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T336886)', diff saved to https://phabricator.wikimedia.org/P49110 and previous config saved to /var/cache/conftool/dbconfig/20230607-143256-ladsgroup.json [14:33:00] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:33:09] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on A:cp-text_eqiad and A:cp [14:33:10] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:33:52] Amir1: deployed on prod [14:34:03] Let's see how that goes [14:34:11] 🍿 [14:35:32] went from 4 min to 20 seconds [14:35:50] o_O not sure what’s happening with the graphs now [14:35:52] two eqiads? [14:36:03] (03PS1) 10Ssingh: lvs2010: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/928067 (https://phabricator.wikimedia.org/T335777) [14:36:05] Lucas_WMDE: yeah, job name changed [14:36:10] ah [14:36:17] from low-traffic-jons-blabla [14:36:19] to just blabla [14:36:28] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [14:36:33] (yes it doesn't want to get copy/pasted) [14:36:57] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs2010 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/928068 (https://phabricator.wikimedia.org/T335777) [14:37:38] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver1001.eqiad.wmnet with reason: host reimage [14:38:04] old line is still getting new data points and going up… do those old jobs all need to vanish first? [14:38:15] (in which case it should happen within 5 minutes I guess) [14:39:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs2010.codfw.wmnet [14:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49111 and previous config saved to /var/cache/conftool/dbconfig/20230607-143907-ladsgroup.json [14:39:11] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:39:18] (03CR) 10Jbond: P:cloudceph::osd: drop the profile::cloudceph::osd::hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [14:39:27] (03CR) 10Jbond: P:cloudceph::osd: drop the profile::cloudceph::osd::hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927628 (owner: 10Jbond) [14:40:10] I think we need to bump it to forty https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&from=now-1h&to=now&viewPanel=2 [14:40:42] maybe, let's wait first [14:40:47] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&from=now-1h&to=now&viewPanel=74 [14:40:48] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.run-puppet-restart-varnish (exit_code=0) rolling custom on A:cp-upload_eqiad and A:cp [14:40:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver1001.eqiad.wmnet with reason: host reimage [14:41:51] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:43:45] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [14:44:37] (03PS4) 10Muehlenhoff: Allow setting a separate LDAP DN for Bitu LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) [14:44:46] I wonder why we're getting a worse job processing rate [14:45:13] (03CR) 10CI reject: [V: 04-1] Allow setting a separate LDAP DN for Bitu LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:45:23] (03CR) 10Robertsky: change wikimaniawiki logo to 2023 version. T337044 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921372 (owner: 10Robertsky) [14:45:32] (03PS1) 10Ladsgroup: changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 45 [deployment-charts] - 10https://gerrit.wikimedia.org/r/928069 (https://phabricator.wikimedia.org/T320534) [14:46:04] claime: ^ I need to be afk for a bit but honestly, a couple of minutes of backlog is not that big of a deal. As long as it doesn't grow to an hour [14:46:07] akosiaris: do you think we may need more jobrunners ? They're getting a little crunched https://grafana.wikimedia.org/goto/oFBZjtlVz?orgId=1 [14:46:21] Not extremely but we're riding the limit [14:47:02] (03PS5) 10Muehlenhoff: Allow setting a separate LDAP DN for Bitu LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) [14:47:35] Amir1: Do you want to bump it now, or wait and see? [14:48:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P49112 and previous config saved to /var/cache/conftool/dbconfig/20230607-144803-ladsgroup.json [14:48:07] wait for now but since I'll be afk feel free to merge and deploy if it reaches let's say half an hour for median [14:48:29] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:48:54] claime: not sure what you see that I should ? [14:48:57] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10cmooney) @KCVelaga_WMF can you post the full command you are running and output? I can successfully execute it on both those servers if I spawn a... [14:49:21] 800 idle works still ? [14:49:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs2010.codfw.wmnet [14:49:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:49:51] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs2010.codfw.wmnet` - lvs2010.codfw.wmnet (**WARN**) - Downtimed ho... [14:49:51] !log jbond@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:50:17] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [14:50:25] akosiaris: It's more the bumps when we restarted changeprop-jobqueue that worried me [14:50:32] But the load seems to be leveling off [14:51:13] (03CR) 10Ssingh: [C: 03+2] lvs2010: decommission host for codfw hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/928067 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:51:29] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:51:32] I'm gonna be in a meeting in ~10 minutes, shouldn't last too long, I'll try to keep an eye on the graphs [14:51:39] moritzm: going to merge the snakeoil :) [14:52:14] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs2010 from lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/928068 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [14:52:28] sukhe: oh, yes. please do :-) [14:52:37] Lucas_WMDE: The old job line has stopped getting updated so yeah, I think it was just a matter of the old jobs finishing [14:52:42] got distracted by one more PCC followup [14:52:52] ok [14:53:01] but the new one is now growing faster than the old one was :S [14:53:38] yeah, concurrency and processing rates have gone down a bit when we upped the concurrency [14:53:46] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin1001" [14:53:47] I'm having trouble understanding why though [14:54:01] !log installing postgresql 13 security updates (clients/libs, server instances all updated already) [14:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P49114 and previous config saved to /var/cache/conftool/dbconfig/20230607-145413-ladsgroup.json [14:54:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:54:47] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin1001" [14:54:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver1001.eqiad.wmnet with OS bookworm [14:54:58] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host puppetserver1001.eqiad.wmnet with OS bookworm completed: - puppetserver1001... [14:55:32] Lucas_WMDE: job run duration is going down though, so hopefully it levels off [14:56:29] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:35] !log homer "cr*-codfw*" commit "Gerrit: 928068 remove decommissioned host lvs2010" [14:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:44] true [14:58:11] (03CR) 10Lucas Werkmeister (WMDE): Enable cache warming jobs for parsoid per default. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927758 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [15:00:08] (03PS1) 10Jbond: 10.arpa: drop unused arpa space [dns] - 10https://gerrit.wikimedia.org/r/928072 [15:00:48] 10SRE, 10SRE-Access-Requests: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10cmooney) [15:02:33] !log de-pooling sessionstore/codfw — T337426 [15:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:36] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [15:02:40] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: maintenance [15:03:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P49115 and previous config saved to /var/cache/conftool/dbconfig/20230607-150309-ladsgroup.json [15:03:44] (03CR) 10Ssingh: [C: 03+1] "Thanks for the quick patch!" [dns] - 10https://gerrit.wikimedia.org/r/928072 (owner: 10Jbond) [15:04:33] (03CR) 10Jbond: [C: 03+2] "cheers" [dns] - 10https://gerrit.wikimedia.org/r/928072 (owner: 10Jbond) [15:06:55] !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetserver2001.mgmt.codfw.wmnet on all recursors [15:06:58] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetserver2001.mgmt.codfw.wmnet on all recursors [15:07:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in codfw: maintenance [15:08:09] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2001.codfw.wmnet with OS bookworm [15:08:20] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host puppetserver2001.codfw.wmnet with OS bookworm [15:08:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [15:09:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P49116 and previous config saved to /var/cache/conftool/dbconfig/20230607-150919-ladsgroup.json [15:09:21] (03PS1) 10Elukey: Revert "Revert "varnishkafka: add catch all systemd unit"" [puppet] - 10https://gerrit.wikimedia.org/r/928087 [15:09:48] (03PS1) 10Ilias Sarantopoulos: ml-services: add gpu support for bloom-560m model [deployment-charts] - 10https://gerrit.wikimedia.org/r/928076 (https://phabricator.wikimedia.org/T333861) [15:10:01] !log disable puppet on all caching nodes to rollout a varnishakfka change (ref: https://gerrit.wikimedia.org/r/c/operations/puppet/+/928087) [15:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:03] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [15:10:10] jouncebot: nowandnext [15:10:10] No deployments scheduled for the next 1 hour(s) and 49 minute(s) [15:10:10] In 1 hour(s) and 49 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1700) [15:10:20] (03CR) 10Esanders: [C: 04-1] "Don't merge until Ia8a7663f is deployed. After that this becomes a no-op." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927653 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [15:10:49] (03CR) 10Elukey: [C: 03+2] ml-services: add gpu support for bloom-560m model [deployment-charts] - 10https://gerrit.wikimedia.org/r/928076 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [15:10:51] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/927758 might still need a revert sooner or later [15:10:51] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade sessionstore2001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926588 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:11:02] (I’m still considering myself responsible for the backport+config window) [15:11:06] but not doing anything right now [15:11:52] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:16] (03CR) 10Elukey: [C: 03+2] Revert "Revert "varnishkafka: add catch all systemd unit"" [puppet] - 10https://gerrit.wikimedia.org/r/928087 (owner: 10Elukey) [15:14:15] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:14:44] (03PS1) 10Effie Mouzeli: ipoid: chart fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/928079 [15:14:45] !log Upgrading Cassandra to 4.1.1, sessionstore2001 — T337426 [15:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:49] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [15:16:30] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: chart fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/928079 (owner: 10Effie Mouzeli) [15:17:10] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney all those connections are no longer on the old switch we can delete those. thanks [15:17:21] Lucas_WMDE: I'm not a fan of that insertion/processing ratio [15:17:22] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetserver2001.codfw.wmnet with OS bookworm [15:17:32] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host puppetserver2001.codfw.wmnet with OS bookworm executed with errors: - puppet... [15:17:33] 128/75 looks like we're going to keep backlogging [15:17:39] (03Merged) 10jenkins-bot: ipoid: chart fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/928079 (owner: 10Effie Mouzeli) [15:17:43] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2001.codfw.wmnet with OS bookworm [15:17:46] I would revert by 16:00 UTC if it doesn’t get better [15:17:49] but also happy to revert sooner [15:17:51] !log installing libwebp security updates on buster [15:17:52] (the config change, that is) [15:17:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host puppetserver2001.codfw.wmnet with OS bookworm [15:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:58] Let me try to bump up concurrency again [15:18:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T336886)', diff saved to https://phabricator.wikimedia.org/P49117 and previous config saved to /var/cache/conftool/dbconfig/20230607-151815-ladsgroup.json [15:18:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:18:18] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:18:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:18:32] Lucas_WMDE: no problem, just checking it cause elukey is currently performing some work on the CDN [15:18:35] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 45 [deployment-charts] - 10https://gerrit.wikimedia.org/r/928069 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [15:18:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T336886)', diff saved to https://phabricator.wikimedia.org/P49118 and previous config saved to /var/cache/conftool/dbconfig/20230607-151835-ladsgroup.json [15:18:42] ok [15:18:42] (03PS5) 10Effie Mouzeli: ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) [15:18:55] Lucas_WMDE: but it shouldn't have any kind of impact on the ability to deploy mw stuff [15:19:00] sounds good, thanks! [15:19:05] (03CR) 10Clément Goubert: [C: 03+2] changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 45 [deployment-charts] - 10https://gerrit.wikimedia.org/r/928069 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [15:19:07] [last famous words] [15:19:09] (03PS6) 10Effie Mouzeli: ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) [15:19:17] hehehe [15:19:57] (03Merged) 10jenkins-bot: changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 45 [deployment-charts] - 10https://gerrit.wikimedia.org/r/928069 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [15:20:36] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [15:21:02] !log Bumping prewarmparsoid concurrency to 45 in changeprop-jobqueue - T320534 [15:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:05] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [15:21:21] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:21:23] (03Merged) 10jenkins-bot: ipoid: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/921707 (https://phabricator.wikimedia.org/T336163) (owner: 10Effie Mouzeli) [15:21:52] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:22:02] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:22:32] !log re-enable puppet on caching nodes [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:22:57] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [15:23:00] Amir1: fyi, concurrency bumped to 45 [15:23:09] !log all varnishkafka instances on caching nodes are getting restarted due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/928087 - T337825 [15:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:12] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [15:23:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T336886)', diff saved to https://phabricator.wikimedia.org/P49119 and previous config saved to /var/cache/conftool/dbconfig/20230607-152333-ladsgroup.json [15:23:36] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:24:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49120 and previous config saved to /var/cache/conftool/dbconfig/20230607-152425-ladsgroup.json [15:24:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:24:46] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic, 10Patch-For-Review: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) The new `varnishkafka-all` unit is being rolled out across all cp nodes. Next steps: * Merge https://gerrit.wikimedia.org/r/924507 (no-op, just... [15:24:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:24:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T336886)', diff saved to https://phabricator.wikimedia.org/P49121 and previous config saved to /var/cache/conftool/dbconfig/20230607-152456-ladsgroup.json [15:25:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) @Volans on lsw1-a1 which is a new switch, after running the cookbook it did PASS . However no configuration was done on the switch itsel... [15:26:30] (03PS1) 10AOkoth: vrts: post script cleanup & export variables [puppet] - 10https://gerrit.wikimedia.org/r/928084 (https://phabricator.wikimedia.org/T330920) [15:26:40] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:26:41] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:26:59] !log rolling restart of FPM on mw canaries to pick up libwebp security updates [15:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:00] Lucas_WMDE: job processing rate looking better [15:27:09] * Lucas_WMDE looks [15:27:27] https://grafana.wikimedia.org/goto/imsUetl4z?orgId=1 [15:27:49] nice [15:27:54] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul thanks I'll remove them from netbox cheers. [15:28:18] I think it'll take a bit of time to trim the backlog down, but it should be fine (hopefuly) [15:29:29] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) Cable IDs em1 - 11995 em2 - 11997 nic2 port 1 - 11996 nic2 port 2 - 11998 [15:29:33] queue wait time is growing a *lot* slower, at least [15:29:45] (03PS1) 10Ilias Sarantopoulos: ml-services: debug HIP for AMD GPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/928085 (https://phabricator.wikimedia.org/T333861) [15:29:55] Lucas_WMDE: yeah we're about even on insertion/processing [15:29:59] 120/112 [15:30:11] (ops/s) [15:32:00] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbook not forcing a Puppet agent run on lvs2011, lvs2012 - https://phabricator.wikimedia.org/T336593 (10ssingh) 05Open→03Resolved a:03ssingh I am going to mark this as resolved as lvs2013 didn't have this issue. Thanks again for the help and... [15:32:18] Lucas_WMDE: We've come on the right side of the insertion/processing ratio [15:32:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T336886)', diff saved to https://phabricator.wikimedia.org/P49122 and previous config saved to /var/cache/conftool/dbconfig/20230607-153221-ladsgroup.json [15:32:24] yay [15:32:26] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:32:27] thanks claime! [15:32:35] We're now processing faster than jobs are being inserted, so backlog should start reducing [15:33:26] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:33:54] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [15:34:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:35:07] (03CR) 10Matthias Mullie: "Gentle reminder :) (no rush, it won't run until next Wed)" [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [15:37:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) It didn't complete successfully, it failed to check the uptime of the switch and asked the operator what to do, and when it was answered... [15:37:39] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2001.codfw.wmnet with reason: host reimage [15:38:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:38:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P49123 and previous config saved to /var/cache/conftool/dbconfig/20230607-153839-ladsgroup.json [15:38:42] PROBLEM - Check systemd state on cp5026 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:46] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:39:07] !log installing isc-dhcp bugfixes updates from Bullseye 11.7 point release [15:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:26] (03PS1) 10Dzahn: admin: remove contint-roots from releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/928108 [15:39:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:40:58] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2001.codfw.wmnet with reason: host reimage [15:41:01] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entry for lvs2014 - pt1979@cumin2002" [15:41:29] RECOVERY - Check systemd state on cp5026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:35] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade sessionstore2002 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926589 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:42:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entry for lvs2014 - pt1979@cumin2002" [15:42:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lvs2014.mgmt.codfw.wmnet with reboot policy FORCED [15:43:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:44:07] (03PS1) 10Effie Mouzeli: ipoid: fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/928110 [15:44:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:44:43] !log Upgrading Cassandra to 4.1.1, sessionstore2002 — T337426 [15:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:46] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [15:44:51] (03CR) 10CI reject: [V: 04-1] ipoid: fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/928110 (owner: 10Effie Mouzeli) [15:47:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [15:47:10] claime: looks like the insertion and processing rate are neck and neck, right now inserting is a tad higher than processing [15:47:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P49124 and previous config saved to /var/cache/conftool/dbconfig/20230607-154727-ladsgroup.json [15:47:32] the backlog hasn’t gone up much but neither has it gone down [15:48:18] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:48:40] (03PS2) 10Effie Mouzeli: ipoid: fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/928110 [15:50:01] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade sessionstore2003 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/926590 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:50:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [15:51:07] (03PS3) 10Effie Mouzeli: ipoid: fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/928110 [15:51:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists1003.wikimedia.org [15:52:07] !log Upgrading Cassandra to 4.1.1, sessionstore2003 — T337426 [15:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:10] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [15:52:11] Lucas_WMDE: jobrunners are ok, I wonder if I should up concurrency even more [15:52:23] (03PS1) 10Btullis: Use standard uppercase for cumin alias P selector [puppet] - 10https://gerrit.wikimedia.org/r/928111 [15:52:29] Amir1, akosiaris: thoughts ^ ? [15:52:51] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) >>! In T324659#8904634, @Dzahn wrote: > @hashar This new machine is on buster. Somehow I thought we did bullseye from t... [15:52:59] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to analytics-product-users for KCVelaga (WMF) - https://phabricator.wikimedia.org/T337766 (10KCVelaga_WMF) @cmooney sure. It is actually the full command and all I output I got. Attaching a screenshot for your reference. {F37096898} [15:53:00] claime: tomorrow you mean, right ? [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P49125 and previous config saved to /var/cache/conftool/dbconfig/20230607-155345-ladsgroup.json [15:53:54] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/928110 (owner: 10Effie Mouzeli) [15:53:58] (03CR) 10Btullis: [C: 03+2] Update the cumin aliases for the wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928021 (https://phabricator.wikimedia.org/T315426) (owner: 10Btullis) [15:54:01] akosiaris: if 18 minutes of backlog until tomorrow is acceptable it can wait until tomorrow [15:54:46] (03Merged) 10jenkins-bot: ipoid: fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/928110 (owner: 10Effie Mouzeli) [15:54:56] claime: we can go ahead and put some servers to teh problem too [15:55:08] I mean we have a pool [15:55:10] claime: which job though ? [15:55:25] akosiaris: parsoidCachePrewarm [15:55:38] oh, that's new, we probably are ok with 18m until tomorrow [15:56:03] claime: it depends on what is happening in the wikis too, so indeed nothing to worry about yet [15:56:33] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [15:56:36] a'ight [15:56:51] !log Beginning (3 hour) generated traffic testing of sessionstore.svc.codfw.wmnet — T337426 [15:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:12] hm, Grafana suddenly shows me “no data” o_O [15:57:41] (03PS1) 10Ssingh: lvs2014: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/928112 (https://phabricator.wikimedia.org/T326767) [15:57:46] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:32] (03PS1) 10Ssingh: sites.yaml: add new LVS host lvs2014 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/928113 (https://phabricator.wikimedia.org/T326767) [15:58:45] (nevermind. apparently my grafana-rw session had just expired, reloading the tab fixed it) [16:00:43] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host lists1003.wikimedia.org [16:02:16] Lucas_WMDE: I'm off, I've briefed US on-call [16:02:34] Lucas_WMDE: are you still planning to revert the cache warming patch? Or should I do it? Queue wait time is at 18 minutes now. [16:02:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P49126 and previous config saved to /var/cache/conftool/dbconfig/20230607-160234-ladsgroup.json [16:02:38] Thanks for following up on the deployment [16:02:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs2014.mgmt.codfw.wmnet with reboot policy FORCED [16:02:56] duesen: It's stabilized at around 18 minutes [16:03:00] Lucas_WMDE: I guess I should read the backlog first [16:03:06] duesen: yep :D [16:03:12] claime: yea, but that's way too high, right? [16:03:22] not planning to revert right now [16:03:32] (also about to be off but I can wait for you to read the backlog at least ^^) [16:04:52] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin2002" [16:05:05] the queue wait time is very slowly decreasing at the moment, fwiw [16:05:52] if we want it to decrease faster, maybe we could also disable the warmup just for one or two large wikis, rather than reverting the whole change [16:05:57] duesen: It's high-ish, but it's way less than yesterday's wait time that peaked around 1h30 [16:07:57] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [16:08:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T336886)', diff saved to https://phabricator.wikimedia.org/P49127 and previous config saved to /var/cache/conftool/dbconfig/20230607-160851-ladsgroup.json [16:08:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:08:55] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:09:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:09:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49128 and previous config saved to /var/cache/conftool/dbconfig/20230607-160912-ladsgroup.json [16:11:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [16:11:40] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:12:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2014'] [16:12:21] claime, Amir1: is the concurrency that you are changing in config the same concurrency that is reported on grafana? because that's hovering around 15... So whatever you are configuring doesn't seem to have an impact on actual concurrency... [16:12:27] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:12:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2014'] [16:13:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:13:23] claime: where can I see the processing rate? [16:14:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49129 and previous config saved to /var/cache/conftool/dbconfig/20230607-161416-ladsgroup.json [16:14:19] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:14:39] akosiaris: insert rate will go up when the US wakes up. it's proportional to the number of edits. [16:15:04] duesen: https://grafana.wikimedia.org/goto/trp2Rh_Vz?orgId=1 [16:15:12] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3050.esams.wmnet [16:16:07] duesen: Mostly what we've done is given it its own lane in changeprop-jobqueue and tweaking that concurrency [16:16:36] ok, I'll leave it with you, youall know what you are doing :) [16:16:51] But that was on Amir1's suggestion, I don't know enough about changeprop-jobqueue's internals to be sure how it relates to what grafana is showing [16:17:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T336886)', diff saved to https://phabricator.wikimedia.org/P49130 and previous config saved to /var/cache/conftool/dbconfig/20230607-161740-ladsgroup.json [16:17:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:17:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:18:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T336886)', diff saved to https://phabricator.wikimedia.org/P49131 and previous config saved to /var/cache/conftool/dbconfig/20230607-161800-ladsgroup.json [16:19:13] (03CR) 10Cwhite: [C: 03+2] opensearch: clean up hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/927769 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [16:20:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T336886)', diff saved to https://phabricator.wikimedia.org/P49132 and previous config saved to /var/cache/conftool/dbconfig/20230607-162012-ladsgroup.json [16:20:15] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:20:38] (03PS1) 10Jameel Kaisar: GeoIP experiments: Stop Network Probes [puppet] - 10https://gerrit.wikimedia.org/r/928116 (https://phabricator.wikimedia.org/T332024) [16:21:24] (03CR) 10Hashar: [C: 04-1] "I have dig in to the Exec { umask => xxx } earlier today cause I wanted to remove it entirely. I think it is required afterall in order to" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [16:21:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lvs2014'] [16:23:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:23:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2014'] [16:23:52] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3050.esams.wmnet [16:25:43] (03PS1) 10Jameel Kaisar: GeoIP experiments: Stop NEL Success Reports [puppet] - 10https://gerrit.wikimedia.org/r/928117 (https://phabricator.wikimedia.org/T332024) [16:27:09] (03CR) 10Jameel Kaisar: "We can also remove the if-else block altogether." [puppet] - 10https://gerrit.wikimedia.org/r/928117 (https://phabricator.wikimedia.org/T332024) (owner: 10Jameel Kaisar) [16:29:08] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:29:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2014'] [16:29:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P49133 and previous config saved to /var/cache/conftool/dbconfig/20230607-162922-ladsgroup.json [16:30:03] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [16:30:04] effie, claime, Amir1: I'm not sure I understand the metrics. processing and enqueue rate on eqiad seem to be roughly the same, but we have another 80 jobs/sec coming in from codfw. They also need to be processed at eqiad, right? [16:30:09] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [16:32:10] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928117 (https://phabricator.wikimedia.org/T332024) (owner: 10Jameel Kaisar) [16:32:24] I can't find a way to stack the graphs for the processing rate, since equiad and codfw have different data sources... [16:32:34] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928116 (https://phabricator.wikimedia.org/T332024) (owner: 10Jameel Kaisar) [16:34:39] yeah that dash needs some love [16:34:59] And yes, they're processed by eqiad jobrunners iiuc [16:35:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P49134 and previous config saved to /var/cache/conftool/dbconfig/20230607-163518-ladsgroup.json [16:44:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P49135 and previous config saved to /var/cache/conftool/dbconfig/20230607-164428-ladsgroup.json [16:45:35] queue wait time is now going up again :( [16:46:00] but I’m now off – if a revert is necessary, I’m sure someone else can deploy it [16:46:08] best of luck! [16:50:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P49137 and previous config saved to /var/cache/conftool/dbconfig/20230607-165024-ladsgroup.json [16:50:39] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:52:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2014'] [16:52:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:52:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2014'] [16:52:45] Amir1, claime: the queue is at 20minutes now, it's not going down... [16:52:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:53:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2014'] [16:53:23] PROBLEM - Disk space on ms-be1040 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops [16:55:02] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2014'] [16:55:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2014'] [16:58:00] duesen: I'll bump it even further [16:58:44] duesen: quick q: Did you enable it for wikidata or commons? [16:59:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49138 and previous config saved to /var/cache/conftool/dbconfig/20230607-165934-ladsgroup.json [16:59:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [16:59:39] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:59:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1700) [17:01:31] Amir1: According to the patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/927758/ no [17:02:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:02:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:02:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49139 and previous config saved to /var/cache/conftool/dbconfig/20230607-170252-ladsgroup.json [17:03:06] claime: good. Let's bump it to 60 [17:03:37] Amir1: If you get some time to explain to me (not necessarily today) how concurrency works in changeprop-jobqueue I'd take it [17:03:53] Prepare the patch I'll deploy? [17:04:23] let me see if there is a doc [17:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T336886)', diff saved to https://phabricator.wikimedia.org/P49140 and previous config saved to /var/cache/conftool/dbconfig/20230607-170530-ladsgroup.json [17:05:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:05:34] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:05:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:05:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T336886)', diff saved to https://phabricator.wikimedia.org/P49141 and previous config saved to /var/cache/conftool/dbconfig/20230607-170551-ladsgroup.json [17:06:03] https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue [17:06:14] very detailed explanation of what concurrency means here [17:07:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2014.codfw.wmnet with OS bullseye [17:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49142 and previous config saved to /var/cache/conftool/dbconfig/20230607-170758-ladsgroup.json [17:08:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye [17:08:08] (03PS1) 10Ladsgroup: changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 60 [deployment-charts] - 10https://gerrit.wikimedia.org/r/928120 (https://phabricator.wikimedia.org/T320534) [17:08:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T336886)', diff saved to https://phabricator.wikimedia.org/P49143 and previous config saved to /var/cache/conftool/dbconfig/20230607-170808-ladsgroup.json [17:08:25] claime: ^ patch up [17:08:42] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 60 [deployment-charts] - 10https://gerrit.wikimedia.org/r/928120 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [17:10:26] Amir1: +1'd, waiting for your +2 [17:10:38] (03CR) 10Ladsgroup: [C: 03+2] changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 60 [deployment-charts] - 10https://gerrit.wikimedia.org/r/928120 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [17:11:11] done sorry [17:11:25] (03Merged) 10jenkins-bot: changeprop-jobqueue: Bump the concurrency for prewarmparsoid to 60 [deployment-charts] - 10https://gerrit.wikimedia.org/r/928120 (https://phabricator.wikimedia.org/T320534) (owner: 10Ladsgroup) [17:11:56] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [17:12:00] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [17:12:26] (03PS1) 10Cathal Mooney: Add rule to allow TFTP to install server to support Juniper ZTP [homer/public] - 10https://gerrit.wikimedia.org/r/928121 (https://phabricator.wikimedia.org/T336485) [17:12:26] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [17:12:58] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [17:13:03] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [17:13:42] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [17:14:01] Let's see how it goes [17:14:18] (03PS1) 10Herron: udp2log: add 6to4 relay [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) [17:17:09] per-pod concurrency looks up to ~22 from 16/17 before [17:18:36] (03PS2) 10Herron: udp2log: add 6to4 relay [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) [17:19:51] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41604/console" [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [17:20:57] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: vo-escalate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:29] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P49144 and previous config saved to /var/cache/conftool/dbconfig/20230607-172304-ladsgroup.json [17:23:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P49145 and previous config saved to /var/cache/conftool/dbconfig/20230607-172315-ladsgroup.json [17:25:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T338347 (10phaultfinder) [17:26:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye [17:26:27] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [17:30:38] PROBLEM - Host db1135 #page is DOWN: PING CRITICAL - Packet loss = 100% [17:30:57] cwhite: ^ I'm up next in the meeting, can you take that? [17:31:59] on it [17:33:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3051.esams.wmnet [17:34:54] !log cwhite@cumin2002 dbctl commit (dc=all): 'depool db1135', diff saved to https://phabricator.wikimedia.org/P49146 and previous config saved to /var/cache/conftool/dbconfig/20230607-173453-cwhite.json [17:35:07] I'm around [17:35:19] backlog for prewarmparsoid looks to be slowly going down [17:35:59] let me check [17:36:05] Amir1: I depooled db1135 per runbook. Are you going to have a look at it? [17:36:12] yeah [17:36:18] ack, thanks! [17:37:07] Interestingly, VO appears to be down right now. [17:37:39] I can't ssh to db1135 [17:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P49147 and previous config saved to /var/cache/conftool/dbconfig/20230607-173810-ladsgroup.json [17:38:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P49148 and previous config saved to /var/cache/conftool/dbconfig/20230607-173821-ladsgroup.json [17:39:16] cwhite Amir1: thanks <3 [17:42:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3051.esams.wmnet [17:43:29] VO is experiencing deployment problems. Can't ack the page until it comes back. [17:43:32] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [17:43:47] Hopefully it isn't too noisy :( [17:44:12] T338354 [17:44:13] T338354: db1135 has crashed - https://phabricator.wikimedia.org/T338354 [17:44:20] <_joe_> can someone ack the alert? [17:44:26] <_joe_> !incidents [17:44:32] Could not fetch teams from the api, sorry [17:44:32] could not find the team [17:44:35] _joe_: VO is down :P [17:44:40] _joe_: see my comment from earlier, VO is unavailable [17:44:46] <_joe_> sigh [17:44:55] _joe_: https://victorops.statuspage.io/ [17:45:02] <_joe_> so my phone will keep ringing for some time I guess [17:45:04] I could ack it from the app [17:45:10] at least it tells me [17:45:11] I can ACK via SMS [17:45:15] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: vo-escalate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:24] ACK != resolved [17:45:28] Amir1 mutante if you can ack it, please do! [17:45:35] did already [17:45:36] done [17:45:38] should I resolve it? [17:45:53] nothing too problematic, right? [17:46:02] thanks! [17:46:15] acked by sending 95238 via SMS [17:46:31] I see no mw errors, so no [17:46:43] !log bking@wdqs depool wdqs2012 T321605 [17:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:46] T321605: Make WCQS/WDQS data transfer cookbook more reliable - https://phabricator.wikimedia.org/T321605 [17:47:23] depooled already and mw sorta depooled it automatically anyway, so user-facing impact is limited [17:47:48] but SEL is interesting, does it mean both memory and CPU are broken? [17:48:00] I'll ack it on icinga so the page can be resolved. [17:48:15] who knows, tomorrow I can give it a second look [17:48:25] https://usercontent.irccloud-cdn.com/file/Wj4MGJDI/image.png [17:48:39] jynus: sel is this T338354 [17:48:53] it could also be a board issue [17:49:09] it can wait :-D [17:49:13] ACKNOWLEDGEMENT - SSH on db1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds cole_white T338354 https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:49:14] ACKNOWLEDGEMENT - Host db1135 #page is DOWN: PING CRITICAL - Packet loss = 100% cole_white T338354 [17:49:50] page can be resolved in VO now [17:50:27] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3051.esams.wmnet,service=cdn [17:50:28] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3051.esams.wmnet,service=ats-be [17:50:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3050.esams.wmnet,service=cdn [17:50:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3050.esams.wmnet,service=ats-be [17:50:35] topranks: 90s called, they want their screens back [17:50:41] ssh and ping is almost always duplicate check btw..probably never had just one of them by itself [17:51:19] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:21] sent "resolve" in VO [17:51:29] thanks! [17:51:31] Amir1: haha you gotta love that old school vibe :) [17:51:59] 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Ladsgroup) [17:52:27] predicts.. they are going to move the RAM to another slot and then we reboot and it's like it never happened [17:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49149 and previous config saved to /var/cache/conftool/dbconfig/20230607-175316-ladsgroup.json [17:53:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1147.eqiad.wmnet with reason: Maintenance [17:53:20] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:53:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T336886)', diff saved to https://phabricator.wikimedia.org/P49150 and previous config saved to /var/cache/conftool/dbconfig/20230607-175327-ladsgroup.json [17:53:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [17:53:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1147.eqiad.wmnet with reason: Maintenance [17:53:36] mutante: bonus point if they blow on the contacts in the process [17:53:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T336886)', diff saved to https://phabricator.wikimedia.org/P49151 and previous config saved to /var/cache/conftool/dbconfig/20230607-175337-ladsgroup.json [17:53:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [17:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T336886)', diff saved to https://phabricator.wikimedia.org/P49152 and previous config saved to /var/cache/conftool/dbconfig/20230607-175347-ladsgroup.json [17:53:51] topranks: only after we upgrade firmware :) [17:54:25] (03PS1) 10Ladsgroup: db1135: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/928127 (https://phabricator.wikimedia.org/T338354) [17:55:16] (03CR) 10Ladsgroup: [C: 03+2] db1135: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/928127 (https://phabricator.wikimedia.org/T338354) (owner: 10Ladsgroup) [17:57:29] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: vo-escalate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T336886)', diff saved to https://phabricator.wikimedia.org/P49153 and previous config saved to /var/cache/conftool/dbconfig/20230607-175833-ladsgroup.json [17:58:36] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [18:00:05] jeena and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1800). [18:00:05] jeena and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T1800). [18:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T336886)', diff saved to https://phabricator.wikimedia.org/P49154 and previous config saved to /var/cache/conftool/dbconfig/20230607-180154-ladsgroup.json [18:02:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:09] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2014.codfw.wmnet with OS bullseye [18:04:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye executed... [18:06:39] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928129 (https://phabricator.wikimedia.org/T337526) [18:06:41] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928129 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [18:07:20] claime: FWIW, we also have the issue that queuing jobs is quite slow now :/ [18:07:22] T338357 [18:07:22] T338357: Pushing jobs to jobqueue is slow again - https://phabricator.wikimedia.org/T338357 [18:07:29] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928129 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [18:07:33] *sighs* [18:07:34] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@d90d5c8]: (no justification provided) [18:08:06] I don't know how to fix that unfortunately [18:08:08] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@d90d5c8]: (no justification provided) (duration: 00m 33s) [18:13:07] (03CR) 10Ladsgroup: [ImageSuggestions] Process suggestions via job queue rather than sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [18:13:27] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P49155 and previous config saved to /var/cache/conftool/dbconfig/20230607-181339-ladsgroup.json [18:14:26] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.12 refs T337526 [18:14:33] T337526: 1.41.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T337526 [18:16:35] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P49156 and previous config saved to /var/cache/conftool/dbconfig/20230607-181700-ladsgroup.json [18:20:32] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.12 refs T337526 (duration: 06m 05s) [18:20:35] T337526: 1.41.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T337526 [18:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:22:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3052.esams.wmnet [18:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:27:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1135.eqiad.wmnet with reason: T338354 [18:27:08] T338354: db1135 has crashed - https://phabricator.wikimedia.org/T338354 [18:27:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1135.eqiad.wmnet with reason: T338354 [18:28:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P49157 and previous config saved to /var/cache/conftool/dbconfig/20230607-182845-ladsgroup.json [18:31:01] 10Puppet, 10Cloud-VPS, 10cloud-services-team: puppet package versioning on Bookworm for cloud-vps - https://phabricator.wikimedia.org/T338195 (10MoritzMuehlenhoff) >>! In T338195#8906984, @Andrew wrote: >> >> I think this is a missing dependency in the package. > > Indeed, installing 'ruby-sorted-set' fix... [18:32:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3052.esams.wmnet [18:32:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P49158 and previous config saved to /var/cache/conftool/dbconfig/20230607-183206-ladsgroup.json [18:41:26] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [18:41:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [18:43:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T336886)', diff saved to https://phabricator.wikimedia.org/P49159 and previous config saved to /var/cache/conftool/dbconfig/20230607-184351-ladsgroup.json [18:43:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1148.eqiad.wmnet with reason: Maintenance [18:43:55] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [18:44:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1148.eqiad.wmnet with reason: Maintenance [18:44:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T336886)', diff saved to https://phabricator.wikimedia.org/P49160 and previous config saved to /var/cache/conftool/dbconfig/20230607-184411-ladsgroup.json [18:46:49] RECOVERY - Host an-worker1125 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [18:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T336886)', diff saved to https://phabricator.wikimedia.org/P49161 and previous config saved to /var/cache/conftool/dbconfig/20230607-184712-ladsgroup.json [18:47:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:47:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:48:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T336886)', diff saved to https://phabricator.wikimedia.org/P49162 and previous config saved to /var/cache/conftool/dbconfig/20230607-184808-ladsgroup.json [18:52:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:53:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:57:48] 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10Jclark-ctr) This server is out of warranty [18:58:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [18:58:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [18:59:43] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Papaul) @Jclark-ctr i took a quick look at dbproxy1022 the server is connected using the second NIC and not the first NIC that is the reason it is not pxe bootin... [18:59:49] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbproxy1022.eqiad.wmnet with OS bullseye [18:59:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [19:02:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2001.codfw.wmnet [19:03:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P49163 and previous config saved to /var/cache/conftool/dbconfig/20230607-190314-ladsgroup.json [19:04:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [19:05:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [19:05:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T336886)', diff saved to https://phabricator.wikimedia.org/P49164 and previous config saved to /var/cache/conftool/dbconfig/20230607-190514-ladsgroup.json [19:05:18] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:07:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T336886)', diff saved to https://phabricator.wikimedia.org/P49165 and previous config saved to /var/cache/conftool/dbconfig/20230607-190737-ladsgroup.json [19:09:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2001.codfw.wmnet [19:10:03] (03PS4) 10Ahmon Dancy: git::clone: Ensure that the URL for origin is always up-to-date [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [19:11:33] !log (Re)pooling codfw sessionstore — T337426 [19:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:36] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [19:11:39] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route pool sessionstore in codfw: maintenance [19:16:42] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in codfw: maintenance [19:18:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P49166 and previous config saved to /var/cache/conftool/dbconfig/20230607-191820-ladsgroup.json [19:21:43] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2012 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:22:07] PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:32] ^^ you can ignore that one [19:22:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P49167 and previous config saved to /var/cache/conftool/dbconfig/20230607-192243-ladsgroup.json [19:22:47] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:23:04] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:23:18] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [19:25:31] (03CR) 10BCornwall: [C: 04-1] "Thanks for catching that! Comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [19:27:52] (03CR) 10Dzahn: [C: 03+1] vrts: post script cleanup & export variables [puppet] - 10https://gerrit.wikimedia.org/r/928084 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [19:28:34] (03CR) 10AOkoth: [C: 03+2] vrts: post script cleanup & export variables [puppet] - 10https://gerrit.wikimedia.org/r/928084 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [19:33:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T336886)', diff saved to https://phabricator.wikimedia.org/P49168 and previous config saved to /var/cache/conftool/dbconfig/20230607-193326-ladsgroup.json [19:33:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1149.eqiad.wmnet with reason: Maintenance [19:33:30] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:33:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1149.eqiad.wmnet with reason: Maintenance [19:33:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T336886)', diff saved to https://phabricator.wikimedia.org/P49169 and previous config saved to /var/cache/conftool/dbconfig/20230607-193357-ladsgroup.json [19:34:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:37] (03PS1) 10BBlack: traffic-pool: update After=services [puppet] - 10https://gerrit.wikimedia.org/r/928131 [19:37:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P49170 and previous config saved to /var/cache/conftool/dbconfig/20230607-193749-ladsgroup.json [19:38:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T336886)', diff saved to https://phabricator.wikimedia.org/P49171 and previous config saved to /var/cache/conftool/dbconfig/20230607-193850-ladsgroup.json [19:38:54] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:39:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:40:16] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: sync [19:40:29] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: sync [19:40:51] !log cp*: disabling puppet temporarily out of an abundance of caution [19:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:57] (03CR) 10BBlack: [C: 03+2] traffic-pool: update After=services [puppet] - 10https://gerrit.wikimedia.org/r/928131 (owner: 10BBlack) [19:41:35] !log manually created 3 global accounts T338197 [19:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:38] T338197: 3 accounts need manual merging into global accounts - https://phabricator.wikimedia.org/T338197 [19:44:43] (03PS1) 10AOkoth: vrts: Fix issue in install script [puppet] - 10https://gerrit.wikimedia.org/r/928133 (https://phabricator.wikimedia.org/T330920) [19:45:50] (03CR) 10AOkoth: [C: 03+2] vrts: Fix issue in install script [puppet] - 10https://gerrit.wikimedia.org/r/928133 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [19:51:43] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [19:51:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [19:52:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T336886)', diff saved to https://phabricator.wikimedia.org/P49172 and previous config saved to /var/cache/conftool/dbconfig/20230607-195255-ladsgroup.json [19:52:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [19:52:59] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:53:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [19:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T336886)', diff saved to https://phabricator.wikimedia.org/P49173 and previous config saved to /var/cache/conftool/dbconfig/20230607-195316-ladsgroup.json [19:53:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P49174 and previous config saved to /var/cache/conftool/dbconfig/20230607-195356-ladsgroup.json [19:54:26] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [19:54:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [19:56:10] (03PS11) 10Jsn.sherman: beta: log additional click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) [19:58:45] (03CR) 10Clare Ming: beta: log additional click events on Special:Diff|MobileDiff (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) (owner: 10Jsn.sherman) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230607T2000). [20:00:05] JSherman, RoanKattouw, and eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:48] Guessing you're going to handle that RoanKattouw ? [20:01:15] Yes I will [20:01:19] ^^ [20:01:22] Greetings All! [20:01:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T336886)', diff saved to https://phabricator.wikimedia.org/P49175 and previous config saved to /var/cache/conftool/dbconfig/20230607-200134-ladsgroup.json [20:01:40] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:01:46] hi there [20:03:53] (03PS5) 10Catrope: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [20:03:57] (03CR) 10Catrope: [C: 03+2] Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [20:04:16] Alright sorry about that delay, I needed some things up and running on my laptop [20:04:24] Going to do eigyan's patch first, then JSherman's, then mine [20:04:39] Thanks RoanKattouw [20:04:46] (03Merged) 10jenkins-bot: Deploy GDI safety survey to JA and RU wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927233 (https://phabricator.wikimedia.org/T337728) (owner: 10Eigyan) [20:07:44] !log catrope@deploy1002 Started scap: Backport for [[gerrit:927233|Deploy GDI safety survey to JA and RU wikis. (T337728)]] [20:07:48] T337728: Deploy Community Safety Survey on RU and JA Wikipedias - estimated week of June 5th 2023 - https://phabricator.wikimedia.org/T337728 [20:09:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P49176 and previous config saved to /var/cache/conftool/dbconfig/20230607-200902-ladsgroup.json [20:09:20] !log catrope@deploy1002 catrope and essexigyan: Backport for [[gerrit:927233|Deploy GDI safety survey to JA and RU wikis. (T337728)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:10:45] eigyan: Your change is on the test servers, please test [20:11:08] RoanKattouw checking now, thanks! [20:11:36] (03PS12) 10Catrope: beta: log additional click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) (owner: 10Jsn.sherman) [20:11:39] (03CR) 10Catrope: [C: 03+2] beta: log additional click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) (owner: 10Jsn.sherman) [20:11:56] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs2012.codfw.wmnet [20:11:56] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs2012.codfw.wmnet [20:12:03] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:12:41] Thank you RoanKattouw, I can see my were successfully deployed to test [20:12:48] (03Merged) 10jenkins-bot: beta: log additional click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) (owner: 10Jsn.sherman) [20:13:00] Great, rolling out to production now [20:13:07] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2012 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:13:08] (03PS1) 10AOkoth: vrts: use variables in rsyncquickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/928136 (https://phabricator.wikimedia.org/T330920) [20:13:27] JSherman: Yours is merged, since it's beta-only it should appear there in approx 10-15 minutes (there's an automated deployment job that runs every 10 mins) [20:13:49] (03CR) 10Volans: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/928111 (owner: 10Btullis) [20:13:58] RoanTattouw: I'll have a look shortly. Thanks! [20:14:05] RoanKattouw: it's every hour now actually https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ [20:14:27] I think, let me double check [20:14:49] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: attempting WDQS stack on bullseye [20:14:50] That's the update-databases job [20:14:55] https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ runs every 10 mins [20:15:00] https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ [20:15:03] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: attempting WDQS stack on bullseye [20:15:06] oh yeah [20:15:10] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: attempting WDQS stack on bullseye [20:15:13] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs1016.eqiad.wmnet with reason: attempting WDQS stack on bullseye [20:15:14] And that feeds into https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ which also runs every 10 mins [20:15:31] Oh oops https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/ is the relevant one for JSherman since it's a config patch [20:15:31] 10ops-codfw, 10DC-Ops: Relabel: puppetserver2005 to puppetserver2001 - https://phabricator.wikimedia.org/T338327 (10Peachey88) [20:15:46] so many jobs [20:15:52] Anyways, JSherman wait another 10 mins or so and tell me if the config change doesn't appear on beta by that time [20:16:01] so little time @Ami [20:16:03] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/928136/41605/" [puppet] - 10https://gerrit.wikimedia.org/r/928136 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [20:16:06] RoanKattouw: wilco [20:16:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P49177 and previous config saved to /var/cache/conftool/dbconfig/20230607-201640-ladsgroup.json [20:18:37] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:927233|Deploy GDI safety survey to JA and RU wikis. (T337728)]] (duration: 10m 53s) [20:18:41] T337728: Deploy Community Safety Survey on RU and JA Wikipedias - estimated week of June 5th 2023 - https://phabricator.wikimedia.org/T337728 [20:19:27] eigyan: Yours is all deployed now [20:19:31] Now moving on to my patches [20:19:43] Going to deploy them together because I don't want to wait for a very slow scap twice [20:19:47] Excellent, thank you RoanKattouw [20:20:25] PROBLEM - php7.4-fpm service on mw1364 is CRITICAL: CRITICAL - Expecting active but unit php7.4-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:20:29] (03PS5) 10Catrope: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913018 (https://phabricator.wikimedia.org/T319064) [20:20:33] (03PS5) 10Catrope: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913019 (https://phabricator.wikimedia.org/T319064) [20:20:39] (03CR) 10Catrope: [C: 03+2] Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913018 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [20:21:28] (03Merged) 10jenkins-bot: Change license from CC BY-SA 3.0 to 4.0 on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913018 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [20:21:43] PROBLEM - Check systemd state on mw1364 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913019 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [20:22:38] (03Merged) 10jenkins-bot: Link to translations of CC BY-SA 4.0 where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/913019 (https://phabricator.wikimedia.org/T319064) (owner: 10Catrope) [20:23:23] !log catrope@deploy1002 Started scap: Backport for [[gerrit:913019|Link to translations of CC BY-SA 4.0 where possible (T319064)]] [20:23:28] T319064: Creative Commons 4.0 Licensing - https://phabricator.wikimedia.org/T319064 [20:23:39] PROBLEM - PHP7 rendering on mw1364 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:24:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T336886)', diff saved to https://phabricator.wikimedia.org/P49178 and previous config saved to /var/cache/conftool/dbconfig/20230607-202408-ladsgroup.json [20:24:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:24:12] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:24:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:24:55] !log catrope@deploy1002 catrope: Backport for [[gerrit:913019|Link to translations of CC BY-SA 4.0 where possible (T319064)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:27:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [20:27:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [20:27:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T336886)', diff saved to https://phabricator.wikimedia.org/P49179 and previous config saved to /var/cache/conftool/dbconfig/20230607-202733-ladsgroup.json [20:28:01] RECOVERY - php7.4-fpm service on mw1364 is OK: OK - php7.4-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:28:13] RECOVERY - PHP7 rendering on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:29:19] RECOVERY - Check systemd state on mw1364 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P49180 and previous config saved to /var/cache/conftool/dbconfig/20230607-203146-ladsgroup.json [20:32:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T336886)', diff saved to https://phabricator.wikimedia.org/P49181 and previous config saved to /var/cache/conftool/dbconfig/20230607-203228-ladsgroup.json [20:32:32] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:34:08] RoanKattouw: I'm not seeing my stream at https://stream-beta.wikimedia.org/v2/stream/mediawiki.special_diff_interactions or https://stream.wikimedia.org/v2/stream/mediawiki.special_diff_interactions [20:34:08] maybe I'm misunderstanding how to test? [20:35:36] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:913019|Link to translations of CC BY-SA 4.0 where possible (T319064)]] (duration: 12m 12s) [20:35:37] Hmm I don't know how that event stream stuff works [20:35:39] T319064: Creative Commons 4.0 Licensing - https://phabricator.wikimedia.org/T319064 [20:36:24] ah, I do see I got some last minute feedback from cming about setting the destination_event_service [20:41:55] (03PS1) 10Jsn.sherman: followup beta: log additional click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928139 [20:42:04] JSherman: ya - i just happened to see your patch -- i'm not sure if you have to specify destinationeventservice explicitly - i believe there is a default [20:43:20] cjming: yeah, I can see all the other beta streams specify it, and the default is not something I can get to [20:43:39] as for testing in the beta ui, there might be a latency [20:45:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bullseye [20:45:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors: - db... [20:46:21] JSherman: https://stream-beta.wmflabs.org/v2/ui/#/?streams=mediawiki.special_diff_interactions << and on the test server, trigger the event and hopefully you'll see it? [20:46:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T336886)', diff saved to https://phabricator.wikimedia.org/P49182 and previous config saved to /var/cache/conftool/dbconfig/20230607-204652-ladsgroup.json [20:46:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [20:46:56] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:47:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [20:47:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:47:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:47:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T336886)', diff saved to https://phabricator.wikimedia.org/P49183 and previous config saved to /var/cache/conftool/dbconfig/20230607-204728-ladsgroup.json [20:47:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P49184 and previous config saved to /var/cache/conftool/dbconfig/20230607-204734-ladsgroup.json [20:49:10] cjming: I was trying to trigger it on en beta. do you think that's wrong? [20:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T336886)', diff saved to https://phabricator.wikimedia.org/P49185 and previous config saved to /var/cache/conftool/dbconfig/20230607-204951-ladsgroup.json [20:50:26] JSherman: oh right, it's on beta -- idk if the stream config is not technically merged, if it will register at that eventstreams ui - i can try to ask around [20:50:44] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [20:50:50] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [20:50:51] and/or if there's a latency (there is for prod) [20:51:19] I went ahead and made a followup patch here with that destination service set: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/928139 [20:55:48] yeah, I'm able to lookup mediawiki.ipinfo_interaction which has the destination set and only seems to exist in labs. So I think I need to have that key here. [20:57:05] RoanKattouw: any chance we can do this followup today https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/928139 ? [20:57:05] or should I reschedule for another window, or revert and do the whole thing over? [21:02:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P49186 and previous config saved to /var/cache/conftool/dbconfig/20230607-210240-ladsgroup.json [21:03:23] RoanKattouw: JSherman: i can scap Jason's patch real quick unless Roan you're already on it [21:03:36] Sorry I'm back now [21:03:49] (03CR) 10Catrope: [C: 03+2] followup beta: log additional click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928139 (owner: 10Jsn.sherman) [21:04:03] thanks! [21:04:18] I've +2ed the patch and I'll pull it onto the deployment server when it's merged. It doesn't need manual deployment from there, the beta cluster jobs will pick it up [21:04:37] (03Merged) 10jenkins-bot: followup beta: log additional click events on Special:Diff|MobileDiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928139 (owner: 10Jsn.sherman) [21:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P49187 and previous config saved to /var/cache/conftool/dbconfig/20230607-210457-ladsgroup.json [21:17:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T336886)', diff saved to https://phabricator.wikimedia.org/P49188 and previous config saved to /var/cache/conftool/dbconfig/20230607-211746-ladsgroup.json [21:17:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [21:17:51] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:18:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1199.eqiad.wmnet with reason: Maintenance [21:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T336886)', diff saved to https://phabricator.wikimedia.org/P49189 and previous config saved to /var/cache/conftool/dbconfig/20230607-211807-ladsgroup.json [21:20:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P49190 and previous config saved to /var/cache/conftool/dbconfig/20230607-212003-ladsgroup.json [21:23:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T336886)', diff saved to https://phabricator.wikimedia.org/P49191 and previous config saved to /var/cache/conftool/dbconfig/20230607-212303-ladsgroup.json [21:23:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:30:22] (03CR) 10Cwhite: [C: 03+1] "This duplicates what is happening in statsd_proxy and it may be a benefit to make this 6to4 relay feature into a reusable define... someda" [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [21:32:35] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs1016.eqiad.wmnet [21:32:35] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1016.eqiad.wmnet [21:32:40] cjming: RoanKattouw: I'm still getting a stream not found / 400 error on that stream, so I'll go back and ask phuedx for some troubleshooting help and come back during another window. [21:32:50] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [21:33:04] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [21:33:59] JSherman: Sam is on sabbatical til July -- I'll do my best to help you in the meantime [21:34:40] (03PS1) 10Dreamy Jazz: Use cuc_timestamp as index field when reading old [extensions/CheckUser] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/928147 (https://phabricator.wikimedia.org/T338287) [21:35:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T336886)', diff saved to https://phabricator.wikimedia.org/P49192 and previous config saved to /var/cache/conftool/dbconfig/20230607-213509-ladsgroup.json [21:35:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [21:35:16] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:35:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [21:35:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49193 and previous config saved to /var/cache/conftool/dbconfig/20230607-213530-ladsgroup.json [21:35:56] Dreamy_Jazz: if you want, we can deploy that ^ [21:36:05] Okay. [21:36:23] (03CR) 10Zabe: [C: 03+2] Use cuc_timestamp as index field when reading old [extensions/CheckUser] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/928147 (https://phabricator.wikimedia.org/T338287) (owner: 10Dreamy Jazz) [21:36:39] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs2012.codfw.wmnet [21:36:39] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs2012.codfw.wmnet [21:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P49194 and previous config saved to /var/cache/conftool/dbconfig/20230607-213809-ladsgroup.json [21:38:50] JSherman: i asked about it in the Event Platform channel and Andrew will look into it - he thinks something might be stuck on beta [21:39:25] but in theory i think it is supposed to show up shortly - but same, i don't see it yet either :( [21:40:33] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:40:47] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:40:51] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service,wdqs-categories.service,wdqs-updater.service,wmf_auto_restart_prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:57] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:41:01] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2012 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:41:37] ^ These can be safely ignored (related to testing of new cookbook) [21:42:44] cjming: I appreciate the followup; thanks! [21:43:01] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:43:07] PROBLEM - Query Service HTTP Port on wdqs2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:43:21] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49195 and previous config saved to /var/cache/conftool/dbconfig/20230607-214325-ladsgroup.json [21:43:31] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:47:57] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:48:15] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2012 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:48:29] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:48:41] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:52:28] (03Merged) 10jenkins-bot: Use cuc_timestamp as index field when reading old [extensions/CheckUser] (wmf/1.41.0-wmf.11) - 10https://gerrit.wikimedia.org/r/928147 (https://phabricator.wikimedia.org/T338287) (owner: 10Dreamy Jazz) [21:53:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P49196 and previous config saved to /var/cache/conftool/dbconfig/20230607-215315-ladsgroup.json [21:53:23] !log zabe@deploy1002 Started scap: Backport for [[gerrit:928147|Use cuc_timestamp as index field when reading old (T338287)]] [21:53:26] T338287: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'timestamp' in 'where clause - https://phabricator.wikimedia.org/T338287 [21:55:01] !log zabe@deploy1002 dreamyjazz and zabe: Backport for [[gerrit:928147|Use cuc_timestamp as index field when reading old (T338287)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:55:03] Dreamy_Jazz: can you test? :) [21:55:19] Sure. Will do now. [21:55:51] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:56:05] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:56:17] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2012 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:57:05] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:58:16] ^^ ryankemper looks like the cook-book isn't downtiming properly for some reason [21:58:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P49197 and previous config saved to /var/cache/conftool/dbconfig/20230607-215831-ladsgroup.json [21:58:51] Test complete [21:59:30] zabe: [21:59:40] thanks [22:00:14] If you need to see the account name I checked, then I can say [22:00:24] But it's my most recent check on enwiki [22:01:14] No exception seen and I could page the results [22:01:15] i don't need it, I just double checked the logs whether there are some errors (there are not) [22:02:03] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2012 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:02:15] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:02:29] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:02:56] I did accidentally trigger the exception again as I reloaded the page without having the debug mode on, so ignore exceptions that just appeared. [22:03:17] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:04:01] For some reason I'm getting the message "x-wikimedia-debug-routing: no match found for the backend specified in X-Wikimedia-Debug" when trying to use debug mode [22:04:52] Fixed that problem [22:05:09] For some reason the firefox extension had unselected a test server option [22:05:11] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:928147|Use cuc_timestamp as index field when reading old (T338287)]] (duration: 11m 48s) [22:05:15] T338287: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'timestamp' in 'where clause - https://phabricator.wikimedia.org/T338287 [22:05:37] should be live [22:05:48] Thanks, will double check. [22:07:31] (03PS17) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [22:08:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T336886)', diff saved to https://phabricator.wikimedia.org/P49198 and previous config saved to /var/cache/conftool/dbconfig/20230607-220821-ladsgroup.json [22:08:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [22:08:25] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:08:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1221.eqiad.wmnet with reason: Maintenance [22:08:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:08:39] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2021 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:08:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:09:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1221 (T336886)', diff saved to https://phabricator.wikimedia.org/P49199 and previous config saved to /var/cache/conftool/dbconfig/20230607-220859-ladsgroup.json [22:09:11] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:10:32] Yep, change successfully deployed. [22:11:41] (03CR) 10BCornwall: [V: 03+1] "Dry run using the new cookbooks_testing stuff on cumin2002:" [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [22:13:09] zabe: Does it need backporting to wmf.12 too? [22:13:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P49200 and previous config saved to /var/cache/conftool/dbconfig/20230607-221338-ladsgroup.json [22:13:56] (03PS1) 10Zabe: Use cuc_timestamp as index field when reading old [extensions/CheckUser] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928148 (https://phabricator.wikimedia.org/T338287) [22:14:03] (03CR) 10Zabe: [C: 03+2] Use cuc_timestamp as index field when reading old [extensions/CheckUser] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928148 (https://phabricator.wikimedia.org/T338287) (owner: 10Zabe) [22:14:08] Dreamy_Jazz: yeah it does [22:14:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T336886)', diff saved to https://phabricator.wikimedia.org/P49201 and previous config saved to /var/cache/conftool/dbconfig/20230607-221408-ladsgroup.json [22:14:12] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:14:30] Thanks. I won't be able to test that, as enwiki doesn't have wmf.12 [22:14:38] Unless I get given rights on testwiki [22:14:48] we can do that [22:14:57] (if you have time) [22:15:18] Sure, though I would also need to find a user with at least 200 edits [22:15:34] who I could check [22:15:52] as otherwise the paging links would not appear and I could not check that the fix has worked [22:16:28] I guess I could check a wide IP range [22:17:01] But would need to find one with enough edits before running the check [22:17:20] The alternative is that the config is lowered so that I can make paging occur for a user with few edits [22:17:35] That config being CheckUserMaximumRowCount [22:19:13] to which would it have to be lowered? You can like my account CU'n, I have 40 edits or so. [22:19:20] ah [22:19:21] wait [22:19:32] those are probably far to old [22:19:59] It could be temporarily lowered to something like 20 [22:20:43] Then there would be a paging link for one and also for 2 [22:21:53] Let me just check if I can modify the links client side to get a smaller paging value. [22:22:07] (03CR) 10Andrea Denisse: [C: 03+1] udp2log: add 6to4 relay [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [22:22:31] (03CR) 10Andrea Denisse: [C: 03+1] opensearch: disable security plugin on codfw [puppet] - 10https://gerrit.wikimedia.org/r/927771 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [22:22:58] I can reduce CheckUserMaximumRowCount on mwdebug [22:23:30] Nope, I cannot modify the limit value for the links. So if you could do that, that would make the test possible [22:23:55] The reason being is that the limit value is included in the JWT token [22:26:05] (03PS1) 10Zabe: TEMP: Decrease wgCheckUserMaximumRowCount to 5 in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928170 [22:27:24] (03CR) 10Zabe: [C: 03+2] TEMP: Decrease wgCheckUserMaximumRowCount to 5 in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928170 (owner: 10Zabe) [22:28:14] (03Merged) 10jenkins-bot: TEMP: Decrease wgCheckUserMaximumRowCount to 5 in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928170 (owner: 10Zabe) [22:28:35] (03PS1) 10Zabe: Revert "TEMP: Decrease wgCheckUserMaximumRowCount to 5 in testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928149 [22:28:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49202 and previous config saved to /var/cache/conftool/dbconfig/20230607-222844-ladsgroup.json [22:28:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [22:28:48] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:28:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [22:29:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T336886)', diff saved to https://phabricator.wikimedia.org/P49203 and previous config saved to /var/cache/conftool/dbconfig/20230607-222905-ladsgroup.json [22:29:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P49204 and previous config saved to /var/cache/conftool/dbconfig/20230607-222914-ladsgroup.json [22:31:43] (03Merged) 10jenkins-bot: Use cuc_timestamp as index field when reading old [extensions/CheckUser] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928148 (https://phabricator.wikimedia.org/T338287) (owner: 10Zabe) [22:32:55] !log zabe@deploy1002 Started scap: Backport for [[gerrit:928148|Use cuc_timestamp as index field when reading old (T338287)]] [22:32:59] T338287: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'timestamp' in 'where clause - https://phabricator.wikimedia.org/T338287 [22:34:25] (03CR) 10Zabe: [C: 03+2] Revert "TEMP: Decrease wgCheckUserMaximumRowCount to 5 in testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928149 (owner: 10Zabe) [22:34:31] !log zabe@deploy1002 zabe: Backport for [[gerrit:928148|Use cuc_timestamp as index field when reading old (T338287)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:34:34] Dreamy_Jazz: on debug hosts, added you to cu in testwiki [22:34:45] Testing now. Thanks! [22:34:47] !log zabe@deploy1002 Sync cancelled. [22:35:11] (03Merged) 10jenkins-bot: Revert "TEMP: Decrease wgCheckUserMaximumRowCount to 5 in testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928149 (owner: 10Zabe) [22:36:09] Config doesn't seem to have been modified [22:36:36] Still see paging links of 200, 500, 1000, 2500 and 5000 [22:36:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T336886)', diff saved to https://phabricator.wikimedia.org/P49205 and previous config saved to /var/cache/conftool/dbconfig/20230607-223644-ladsgroup.json [22:36:48] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:36:56] Dreamy_Jazz: which debug host do you use? [22:37:08] *facepalm@ [22:37:12] Didn't use a debug host [22:37:26] okay :p [22:37:27] Apologies [22:37:48] Test complete and successful [22:37:59] !log zabe@deploy1002 Started scap: T338287 [22:38:01] nice [22:38:02] T338287: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'timestamp' in 'where clause - https://phabricator.wikimedia.org/T338287 [22:44:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P49206 and previous config saved to /var/cache/conftool/dbconfig/20230607-224420-ladsgroup.json [22:45:30] !log zabe@deploy1002 Finished scap: T338287 (duration: 07m 30s) [22:45:33] T338287: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'timestamp' in 'where clause - https://phabricator.wikimedia.org/T338287 [22:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P49207 and previous config saved to /var/cache/conftool/dbconfig/20230607-225150-ladsgroup.json [22:59:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T336886)', diff saved to https://phabricator.wikimedia.org/P49208 and previous config saved to /var/cache/conftool/dbconfig/20230607-225926-ladsgroup.json [22:59:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:59:30] (Device rebooted) firing: Alert for device ps1-b1-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [22:59:30] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:59:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [23:02:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [23:02:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [23:02:36] (03PS1) 10Zabe: Add T336556 to list of tasks for wmgUseGraph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928171 [23:03:59] (03CR) 10Zabe: [C: 03+2] Add T336556 to list of tasks for wmgUseGraph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928171 (owner: 10Zabe) [23:04:30] (Device rebooted) resolved: Device ps1-b1-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:05:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [23:05:24] (03Merged) 10jenkins-bot: Add T336556 to list of tasks for wmgUseGraph [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928171 (owner: 10Zabe) [23:05:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [23:05:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T336886)', diff saved to https://phabricator.wikimedia.org/P49209 and previous config saved to /var/cache/conftool/dbconfig/20230607-230540-ladsgroup.json [23:05:44] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:06:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P49210 and previous config saved to /var/cache/conftool/dbconfig/20230607-230657-ladsgroup.json [23:10:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T336886)', diff saved to https://phabricator.wikimedia.org/P49211 and previous config saved to /var/cache/conftool/dbconfig/20230607-231045-ladsgroup.json [23:10:49] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:22:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T336886)', diff saved to https://phabricator.wikimedia.org/P49212 and previous config saved to /var/cache/conftool/dbconfig/20230607-232203-ladsgroup.json [23:22:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [23:22:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:22:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [23:22:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49213 and previous config saved to /var/cache/conftool/dbconfig/20230607-232223-ladsgroup.json [23:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P49214 and previous config saved to /var/cache/conftool/dbconfig/20230607-232551-ladsgroup.json [23:30:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49215 and previous config saved to /var/cache/conftool/dbconfig/20230607-233016-ladsgroup.json [23:30:20] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P49216 and previous config saved to /var/cache/conftool/dbconfig/20230607-234057-ladsgroup.json [23:45:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P49217 and previous config saved to /var/cache/conftool/dbconfig/20230607-234522-ladsgroup.json [23:56:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T336886)', diff saved to https://phabricator.wikimedia.org/P49218 and previous config saved to /var/cache/conftool/dbconfig/20230607-235603-ladsgroup.json [23:56:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [23:56:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:56:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [23:56:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T336886)', diff saved to https://phabricator.wikimedia.org/P49219 and previous config saved to /var/cache/conftool/dbconfig/20230607-235624-ladsgroup.json