[00:00:48] 10SRE, 10ops-ulsfo: Degraded RAID on cp4021 - https://phabricator.wikimedia.org/T293225 (10Dzahn) There are a few things wrong with this automatically created ticket. First of all.. RAID is ok on that host: ` ./check_raid OK: Active: 2, Working: 2, Failed: 0, Spare: 0 OK ` It was just down because appar... [00:02:24] 10SRE, 10ops-ulsfo: Degraded RAID on cp4021 - https://phabricator.wikimedia.org/T293225 (10Dzahn) This is invalid, SAL confirms this host was reimaged. But maybe we want to fix the side issues? [00:02:35] 10SRE, 10ops-ulsfo: Degraded RAID on cp4021 - https://phabricator.wikimedia.org/T293225 (10Dzahn) p:05Triageβ†’03Low [00:02:45] 10SRE: Degraded RAID on cp4021 - https://phabricator.wikimedia.org/T293225 (10Dzahn) [00:03:42] cjming: Not updated yet: https://ks.wiktionary.org/static/images/project-logos/kswiktionary.png , https://ks.wiktionary.org/static/images/project-logos/kswiki.png and https: //ks.wiktionary.org/static/images/project-logos/kswiki-2x.png [00:03:55] * https://ks.wiktionary.org/static/images/project-logos/kswiki-2x.png [00:04:09] Juan_90264: we're working on doing https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging [00:04:21] we'll ping you when that's done [00:05:08] All right [00:06:08] Juan_90264: we purged all the logos - can you check? [00:07:11] Now the three mentioned been updated [00:07:24] cool. [00:07:37] Thanks guys! [00:07:48] !log end of UTC late backport & config training window [00:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:56] (03PS1) 10RLazarus: CI: Install mypy type stubs for pyyaml, requests [software/httpbb] - 10https://gerrit.wikimedia.org/r/730945 [00:09:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:16] (03CR) 10RLazarus: [C: 03+2] CI: Install mypy type stubs for pyyaml, requests [software/httpbb] - 10https://gerrit.wikimedia.org/r/730945 (owner: 10RLazarus) [00:12:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:13:50] (03Merged) 10jenkins-bot: CI: Install mypy type stubs for pyyaml, requests [software/httpbb] - 10https://gerrit.wikimedia.org/r/730945 (owner: 10RLazarus) [00:14:53] (03PS1) 10Jforrester: CommonSettings: Drop legacy CentralAuth config flag, never read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730946 (https://phabricator.wikimedia.org/T277932) [00:24:23] (03PS1) 10RLazarus: Bump requirements to debian buster versions [software/httpbb] - 10https://gerrit.wikimedia.org/r/730950 [00:29:28] PROBLEM - MD RAID on labweb1002 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:29:29] ACKNOWLEDGEMENT - MD RAID on labweb1002 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T293428 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:29:32] 10SRE, 10ops-eqiad: Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10ops-monitoring-bot) [00:32:18] (03CR) 10RLazarus: [C: 03+2] Bump requirements to debian buster versions [software/httpbb] - 10https://gerrit.wikimedia.org/r/730950 (owner: 10RLazarus) [00:33:26] (03Merged) 10jenkins-bot: Bump requirements to debian buster versions [software/httpbb] - 10https://gerrit.wikimedia.org/r/730950 (owner: 10RLazarus) [00:38:45] (03PS4) 10RLazarus: add httpbb.main to console-scripts entry_points [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 (owner: 10CDanis) [00:40:51] (03CR) 10RLazarus: [C: 03+2] "Thanks for the patch!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 (owner: 10CDanis) [00:41:55] (03Merged) 10jenkins-bot: add httpbb.main to console-scripts entry_points [software/httpbb] - 10https://gerrit.wikimedia.org/r/640256 (owner: 10CDanis) [00:45:06] 10SRE-swift-storage, 10TimedMediaHandler-Transcode: Intermittent transcode failure 'An unknown error occurred in storage backend "local-swift-codfw".' - https://phabricator.wikimedia.org/T201090 (10AlexisJazz) More failures: T283514 [01:22:50] PROBLEM - Device not healthy -SMART- on labweb1002 is CRITICAL: cluster=misc device=sdb instance=labweb1002 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labweb1002&var-datasource=eqiad+prometheus/ops [01:46:41] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T293053 (10Jacquelinechen) Happy Friday! and thank you. =)) [02:00:48] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:38] !log T288231 `wdqs2006` data transfer complete and all tests passing on the host. All of `codfw wdqs-internal` is on the new streaming updater [02:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:47] T288231: Deploy the wdqs streaming updater to production - https://phabricator.wikimedia.org/T288231 [02:37:02] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:05:00] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: excimer-wall-log.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:02] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: excimer-wall-log.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:39] (03PS1) 10Effie Mouzeli: profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) [05:12:06] (03CR) 10jerkins-bot: [V: 04-1] profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [05:27:35] (03PS2) 10Effie Mouzeli: profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) [05:29:03] (03CR) 10jerkins-bot: [V: 04-1] profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [05:48:08] (03PS3) 10Juan90264: Create Rhymes namespace for thwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730738 (https://phabricator.wikimedia.org/T291761) [06:03:18] (03PS1) 10Elukey: charts: update api-gateway's comment about service routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/730963 [06:15:17] (03PS1) 10Elukey: WIP - hemlfile.d: add the inference service to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/730965 [06:20:08] !log Start server-side upload for 1 video file [06:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:39] (03PS3) 10Effie Mouzeli: profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) [06:45:09] (03CR) 10jerkins-bot: [V: 04-1] profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [06:51:09] (03CR) 10Muehlenhoff: [C: 04-1] builder/systemtap: merge role::systemtap::devserver into builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211015T0700) [07:04:07] (03CR) 10Ema: builder/systemtap: merge role::systemtap::devserver into builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730863 (owner: 10Dzahn) [07:14:08] (03PS2) 10Elukey: Update api-gateway chart's comment about service routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/730963 [07:14:10] (03PS2) 10Elukey: hemlfile.d: add the inference service to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/730965 (https://phabricator.wikimedia.org/T288789) [07:14:12] (03PS1) 10Elukey: api-gateway: allow HTTP host header rewrite for discovery endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) [07:15:42] (03PS2) 10Elukey: api-gateway: allow HTTP host header rewrite for discovery endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) [07:15:44] (03PS3) 10Elukey: hemlfile.d: add the inference service to api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/730965 (https://phabricator.wikimedia.org/T288789) [07:16:50] (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server::mediawiki: add logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/730967 (https://phabricator.wikimedia.org/T288851) [07:37:06] (03PS3) 10Juan90264: Create an alias for the Draft namespace on hrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730744 (https://phabricator.wikimedia.org/T291755) [07:38:08] (03PS2) 10Gehel: wdqs: enable the streaming updater on wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/730796 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [07:39:59] (03CR) 10Gehel: [C: 03+2] wdqs: enable the streaming updater on wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/730796 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [07:41:18] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [07:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:57:18] (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog2002: apply role::syslog::centralserver [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [07:58:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/730897 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [08:04:08] 10SRE-swift-storage: swift-ring deploys should rsync TARGETS to puppet volatile - https://phabricator.wikimedia.org/T293438 (10fgiunchedi) [08:08:57] (03PS2) 10Giuseppe Lavagetto: kubernetes::deployment_server::mediawiki: add logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/730967 (https://phabricator.wikimedia.org/T288851) [08:12:41] (03CR) 10Ayounsi: centrallog2002: apply role::syslog::centralserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [08:13:36] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Peachey88) [08:23:59] (03CR) 10Muehlenhoff: "One comment inline, otherwise looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [08:25:13] (03CR) 10David Caro: "Nice! Got some questions and a small request :)" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [08:25:24] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [08:34:27] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/730837 (owner: 10Jbond) [08:36:19] (03CR) 10David Caro: [C: 03+1] "πŸŽ‰" [puppet] - 10https://gerrit.wikimedia.org/r/730853 (owner: 10Jbond) [08:37:45] (03CR) 10David Caro: [C: 03+1] "πŸ‘" [puppet] - 10https://gerrit.wikimedia.org/r/730857 (owner: 10Jbond) [08:40:17] (03PS3) 10Giuseppe Lavagetto: kubernetes::deployment_server::mediawiki: add logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/730967 (https://phabricator.wikimedia.org/T288851) [08:42:04] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31712/console" [puppet] - 10https://gerrit.wikimedia.org/r/730967 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [08:51:29] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) Topic mappr for eqiad: ` ./topicmappr rebuild --out-path /home/elukey/T225005/json --force-rebuild --zk-addr conf2004.codfw.wmnet --zk-prefix kafka/main-codfw --brokers -2 --t... [08:52:33] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) I'm going to disable puppet on production GitLab (gitlab1001) soon for around two hours to test https://gerrit.wikimed... [08:57:40] (03PS4) 10Giuseppe Lavagetto: kubernetes::deployment_server::mediawiki: add logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/730967 (https://phabricator.wikimedia.org/T288851) [08:58:27] !log jelto@gitlab1001:~$ sudo disable-puppet "disable puppet on gitlab1001 to test 728380 on GitLab replica - T283076" [08:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:35] T283076: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 [08:59:12] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31713/console" [puppet] - 10https://gerrit.wikimedia.org/r/730967 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [08:59:25] (03CR) 10David Caro: standard::ntp: move standard ntp to its own profile (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [09:01:12] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] kubernetes::deployment_server::mediawiki: add logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/730967 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [09:01:47] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) In this case it is interesting to notice the difference between main-codfw and eqiad: ` Broker change summary: New broker 2005 New broker 2004 - Replacing 0, added 2,... [09:03:38] (03CR) 10Muehlenhoff: standard::ntp: move standard ntp to its own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [09:04:23] (03PS5) 10Majavah: openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) [09:04:33] (03CR) 10Majavah: "check experiemental" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:04:58] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) Everything committed to https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/tree/main/main-eqiad, I'll probably execute the plan on Monday. [09:05:28] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:06:00] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [09:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:07:58] (03PS6) 10Majavah: openstack: haproxy: add tls termination support [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) [09:08:06] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Joe) [09:08:09] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:08:16] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Joe) p:05Triageβ†’03Medium [09:10:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/730857 (owner: 10Jbond) [09:11:21] (03CR) 10Majavah: openstack: haproxy: add tls termination support (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:13:07] (03PS1) 10MVernon: codfw-prod: final weight to ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730976 (https://phabricator.wikimedia.org/T290881) [09:13:17] (03PS2) 10Gehel: wdqs: enable the streaming updater on wdqs2002 [puppet] - 10https://gerrit.wikimedia.org/r/730797 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [09:14:15] (03CR) 10Gehel: [C: 03+2] wdqs: enable the streaming updater on wdqs2002 [puppet] - 10https://gerrit.wikimedia.org/r/730797 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [09:15:21] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [09:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/730837 (owner: 10Jbond) [09:15:58] (03PS1) 10Vgutierrez: Release 0.34 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/730977 (https://phabricator.wikimedia.org/T292619) [09:16:37] (03CR) 10Muehlenhoff: [C: 03+1] "Nice :-)" [puppet] - 10https://gerrit.wikimedia.org/r/730856 (owner: 10Jbond) [09:18:59] (03CR) 10Filippo Giunchedi: [C: 03+1] codfw-prod: final weight to ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730976 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [09:21:09] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab::ssh explicitly add git user with fixed id [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [09:21:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitaly,gitlab,nginx,redis_gitlab,sidekiq,workhorse} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:49] ^ thats me, is expected [09:21:50] (03CR) 10Vgutierrez: [C: 03+2] Release 0.34 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/730977 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:23:56] ack [09:24:49] (03Merged) 10jenkins-bot: Release 0.34 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/730977 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:28:36] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:28:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:31:15] (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: final weight to ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730976 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [09:32:44] (03CR) 10David Caro: [C: 03+1] "😎 nice!" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:35:52] (03CR) 10Filippo Giunchedi: "Hi all, I was wondering what's needed to move forward for this and/or more folks might be interested? thank you" [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [09:38:42] (03CR) 10David Caro: standard::ntp: move standard ntp to its own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [09:41:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. There's a general comment which applies to all clustered services and which requires some more throught/discussion, so it's pr" [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [09:46:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. This is only used in Cloud VPS, so maybe doublecheck beforehand if anyone uses the common role in Horizon (but it seems very u" [puppet] - 10https://gerrit.wikimedia.org/r/730862 (owner: 10Dzahn) [09:46:26] (03CR) 10David Caro: [C: 03+1] "πŸŽ‰" [puppet] - 10https://gerrit.wikimedia.org/r/730856 (owner: 10Jbond) [09:47:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10EChetty) [09:47:47] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/728246 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [09:48:30] (03CR) 10David Caro: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/730879 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [09:48:41] (03PS1) 10Vgutierrez: acme_chief: auto-detect systemd watchdog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730979 (https://phabricator.wikimedia.org/T292619) [09:48:43] (03PS1) 10Vgutierrez: Release 0.34 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730980 (https://phabricator.wikimedia.org/T292619) [09:48:45] (03PS1) 10Vgutierrez: debian: Add release 0.34 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730981 (https://phabricator.wikimedia.org/T292619) [09:49:11] (03PS2) 10Majavah: acme_chief: add wildcard to openstack certs [puppet] - 10https://gerrit.wikimedia.org/r/728246 (https://phabricator.wikimedia.org/T267194) [09:50:59] (03PS4) 10Effie Mouzeli: profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) [09:51:18] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) GitLab on the replica looks fine and change of the uid/gid was successful. I used the following steps: ` sudo /usr/bi... [09:52:27] (03CR) 10jerkins-bot: [V: 04-1] profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [09:52:29] (03PS1) 10Muehlenhoff: Record LDAP access for jacquelinechen [puppet] - 10https://gerrit.wikimedia.org/r/730982 (https://phabricator.wikimedia.org/T293053) [09:53:51] (03CR) 10Muehlenhoff: [C: 03+2] Record LDAP access for jacquelinechen [puppet] - 10https://gerrit.wikimedia.org/r/730982 (https://phabricator.wikimedia.org/T293053) (owner: 10Muehlenhoff) [09:54:13] (03CR) 10Vgutierrez: [C: 03+2] Release 0.34 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730980 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:54:20] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: auto-detect systemd watchdog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730979 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:54:22] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.34 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730981 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:57:56] (03Merged) 10jenkins-bot: acme_chief: auto-detect systemd watchdog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730979 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:58:07] (03Merged) 10jenkins-bot: Release 0.34 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730980 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:58:09] (03Merged) 10jenkins-bot: debian: Add release 0.34 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/730981 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:59:40] XioNoX topranks I seemed to remember we did have hostnames in librenms alerts, and found out why we don't anymore https://phabricator.wikimedia.org/T273716#7430992 [10:02:11] I'll send the patches for the current version at least, but can't dedicate much more time to it [10:02:41] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10CAS-SSO, 10User-jbond: Thanos and Grafana lose the session after an hour - https://phabricator.wikimedia.org/T268233 (10MoritzMuehlenhoff) Thanks Timo This quarter we're drafting our plans for requirements in terms of 2FA (and the implied tr... [10:04:06] godog: oh, are you saying we need to backport them? [10:08:14] godog: thanks for doing that. I guess we need to work out how to re-integrate these as we upgrade each time in future? [10:09:22] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) 05Resolvedβ†’03Open a:05Joeβ†’03JMeybohm I'll reopen this one as it has more context on the topic of "which API to use for configuration". I've created a simple is... [10:09:27] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10JMeybohm) [10:11:50] XioNoX topranks indeed ("yes" to both your questions) [10:12:19] current status: in a fight with git review [10:13:23] I guess with T278309 we could keep them as quilt patches [10:13:23] T278309: Move librenms deployment to Debian package - https://phabricator.wikimedia.org/T278309 [10:14:41] godog: heh ok. [10:14:44] majavah: that's right yeah, would be one of the advantages [10:14:46] I've been waiting for the next release (which should include the Prometheus exporter changes) and was going to upgrade then. [10:15:46] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The following units failed: session-214515.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:09] (03PS1) 10Filippo Giunchedi: Use 'title' as Alertmanager summary [software/librenms] (upstream-21.4.0) - 10https://gerrit.wikimedia.org/r/731007 (https://phabricator.wikimedia.org/T273716) [10:16:11] (03PS1) 10Filippo Giunchedi: Use device's 'alerts' page as Alertmanager 'source' link [software/librenms] (upstream-21.4.0) - 10https://gerrit.wikimedia.org/r/731008 (https://phabricator.wikimedia.org/T273716) [10:16:13] (03PS1) 10Filippo Giunchedi: Add 'timestamp' annotation to AM alerts [software/librenms] (upstream-21.4.0) - 10https://gerrit.wikimedia.org/r/731009 (https://phabricator.wikimedia.org/T273716) [10:16:39] there we go, I'm not smart enough to teach 'git review' to DTRT in this case [10:18:05] topranks: yeah that would work too I think, it's been like that since last april anyways [10:18:31] the reviews are out but I'm fine to wait for the new upstream version [10:19:00] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I think we should take another approach instead, left a few comments on https://phabricator.wikimedia.org/T267194#7431108" [puppet] - 10https://gerrit.wikimedia.org/r/728260 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [10:19:04] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I think we should take another approach instead, left a few comments on https://phabricator.wikimedia.org/T267194#7431108" [dns] - 10https://gerrit.wikimedia.org/r/730879 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [10:19:21] Yeah it might make more sense, reduce the overall amount of work anyway. [10:20:10] I'm updating the librenms upgrade docs to mention this pitfall at least [10:20:35] Yeah good call. either way we need to understand how to deal with this. [10:23:01] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [10:24:22] {{done}} [10:25:07] alright I'll abandon the above patches then topranks [10:25:28] thanks, sorry for the complication [10:25:45] 21.10 should be out very soon. [10:25:57] sure no worries, not a big deal overall [10:26:38] (03Abandoned) 10Filippo Giunchedi: Use device's 'alerts' page as Alertmanager 'source' link [software/librenms] (upstream-21.4.0) - 10https://gerrit.wikimedia.org/r/731008 (https://phabricator.wikimedia.org/T273716) (owner: 10Filippo Giunchedi) [10:26:43] (03Abandoned) 10Filippo Giunchedi: Add 'timestamp' annotation to AM alerts [software/librenms] (upstream-21.4.0) - 10https://gerrit.wikimedia.org/r/731009 (https://phabricator.wikimedia.org/T273716) (owner: 10Filippo Giunchedi) [10:26:48] (03Abandoned) 10Filippo Giunchedi: Use 'title' as Alertmanager summary [software/librenms] (upstream-21.4.0) - 10https://gerrit.wikimedia.org/r/731007 (https://phabricator.wikimedia.org/T273716) (owner: 10Filippo Giunchedi) [10:27:00] some gerrit spam served for friday lunch [10:27:47] not unlike mr Clive https://www.youtube.com/watch?v=1iY9hut8kPY [10:28:25] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10Joe) For the record, we've created a `wmf-certificates` debian package that includes the puppet CA and the internal PKI created by @j... [10:32:14] (03PS1) 10Giuseppe Lavagetto: mwdebug: add logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/731011 [10:32:17] lol. delicious :) [10:33:10] haha yeah I think I audibly gagged [10:33:26] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: add logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/731011 (owner: 10Giuseppe Lavagetto) [10:34:37] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Stage drmrs in Netbox - https://phabricator.wikimedia.org/T283594 (10ayounsi) All infra cables needed for remote hands have been created in Netbox. Most IPs allocated as well. [10:35:37] super-quick way to cook food though... I'll need to give it some consideration :D [10:37:45] (03Merged) 10jenkins-bot: mwdebug: add logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/731011 (owner: 10Giuseppe Lavagetto) [10:37:47] heheh he's resourceful alright, IIRC he tried the same with other food too [10:38:28] (03CR) 10Jbond: [C: 03+1] "lgtm, however as per comment i would just drop them" [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:40:04] (03CR) 10Jbond: [C: 03+1] peek: replace crons with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:42:44] (03PS1) 10David Caro: p:environment: Move wmcs specific etc files to p:wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) [10:45:10] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:46:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [10:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:13] (03CR) 10Muehlenhoff: peek: replace crons with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:48:31] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [10:49:22] (03CR) 10David Caro: p:environment: Move wmcs specific etc files to p:wmcs::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [10:54:55] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31715/console" [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [10:59:10] (03PS2) 10Lucas Werkmeister (WMDE): Set dispatchViaJobsAllowedClients to null everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730747 (https://phabricator.wikimedia.org/T291828) [10:59:12] (03PS2) 10Lucas Werkmeister (WMDE): Remove $wmgWikibaseDispatchViaJobsAllowedClients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730748 (https://phabricator.wikimedia.org/T291828) [10:59:14] (03PS1) 10Lucas Werkmeister (WMDE): Unconditionally enable Wikibase dispatching via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731014 (https://phabricator.wikimedia.org/T292604) [10:59:16] (03PS1) 10Lucas Werkmeister (WMDE): Remove wmg variables for dispatch via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731015 (https://phabricator.wikimedia.org/T292604) [11:00:15] (03PS2) 10Lucas Werkmeister (WMDE): Unconditionally enable Wikibase dispatching via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731014 (https://phabricator.wikimedia.org/T291828) [11:00:17] (03PS2) 10Lucas Werkmeister (WMDE): Remove wmg variables for dispatch via jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731015 (https://phabricator.wikimedia.org/T291828) [11:01:18] (03PS6) 10Vgutierrez: acme_chief: Support systemd watchdog [puppet] - 10https://gerrit.wikimedia.org/r/730016 (https://phabricator.wikimedia.org/T292619) [11:12:17] (03PS1) 10Jbond: isystemd::sysuser: create option to allow users to login [puppet] - 10https://gerrit.wikimedia.org/r/731017 (https://phabricator.wikimedia.org/T283076) [11:13:54] (03PS1) 10Vgutierrez: acme_chief: Enable watchdog on acmechief-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/731018 (https://phabricator.wikimedia.org/T292619) [11:14:11] (03CR) 10Jelto: [C: 03+1] "lgtm thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/731017 (https://phabricator.wikimedia.org/T283076) (owner: 10Jbond) [11:14:18] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31716/console" [puppet] - 10https://gerrit.wikimedia.org/r/731017 (https://phabricator.wikimedia.org/T283076) (owner: 10Jbond) [11:15:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] isystemd::sysuser: create option to allow users to login [puppet] - 10https://gerrit.wikimedia.org/r/731017 (https://phabricator.wikimedia.org/T283076) (owner: 10Jbond) [11:15:41] (03CR) 10Jbond: "@moritz, i pushed this through as jelto is the only one using it now but please doa post review just incase" [puppet] - 10https://gerrit.wikimedia.org/r/731017 (https://phabricator.wikimedia.org/T283076) (owner: 10Jbond) [11:20:08] (03CR) 10Vgutierrez: "PCC seems happy: https://puppet-compiler.wmflabs.org/compiler1002/31717/" [puppet] - 10https://gerrit.wikimedia.org/r/731018 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [11:20:10] (03PS1) 10Giuseppe Lavagetto: mediawiki: use correct notation to refer to labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/731020 [11:20:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: use correct notation to refer to labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/731020 (owner: 10Giuseppe Lavagetto) [11:20:48] (ThanosQueryHttpRequestQueryErrorRateHigh) firing: Thanos Query Frontend is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org [11:24:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2007.codfw.wmnet [11:24:33] (03Merged) 10jenkins-bot: mediawiki: use correct notation to refer to labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/731020 (owner: 10Giuseppe Lavagetto) [11:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:25] (03CR) 10Jbond: [C: 03+1] "LGTM apart from the indentation" [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [11:28:35] (03CR) 10Jbond: [C: 03+2] adduser: move login.defs config to adduser [puppet] - 10https://gerrit.wikimedia.org/r/730837 (owner: 10Jbond) [11:28:39] (03CR) 10Jbond: [C: 03+2] P:mail::default_mail_relay: move templates to correct location [puppet] - 10https://gerrit.wikimedia.org/r/730853 (owner: 10Jbond) [11:28:51] (03CR) 10Jbond: [C: 03+2] standrd::ntp: fix ntp order [puppet] - 10https://gerrit.wikimedia.org/r/730857 (owner: 10Jbond) [11:33:39] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:12] (03PS1) 10Lucas Werkmeister (WMDE): mediawiki: Absent wikidatawiki change pruning [puppet] - 10https://gerrit.wikimedia.org/r/731027 (https://phabricator.wikimedia.org/T292604) [11:35:14] (03PS1) 10Lucas Werkmeister (WMDE): mediawiki: Drop absented wikidatawiki change pruning [puppet] - 10https://gerrit.wikimedia.org/r/731028 (https://phabricator.wikimedia.org/T292604) [11:39:20] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix cpu requests unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/731030 [11:40:53] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [11:40:56] (03PS8) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [11:42:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31718/console" [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [11:43:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31719/console" [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [11:44:18] (03PS4) 10Jbond: standard: remove standard module [puppet] - 10https://gerrit.wikimedia.org/r/730856 [11:45:56] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:47] (03PS2) 10Gehel: wdqs: enable the streaming updater on wdqs2003 [puppet] - 10https://gerrit.wikimedia.org/r/730798 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [11:48:23] (03CR) 10Gehel: [C: 03+2] wdqs: enable the streaming updater on wdqs2003 [puppet] - 10https://gerrit.wikimedia.org/r/730798 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [11:48:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2007.codfw.wmnet [11:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix cpu requests unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/731030 (owner: 10Giuseppe Lavagetto) [11:49:33] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [11:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:56] (03CR) 10Jbond: [C: 03+1] sre.misc-clusters.thumbor: create batch action cook book for thumbor (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [11:50:04] (03CR) 10Jbond: [C: 03+2] sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [11:50:48] (ThanosQueryHttpRequestQueryErrorRateHigh) resolved: Thanos Query Frontend is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org [11:52:50] (03Merged) 10jenkins-bot: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [11:53:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [11:53:45] (03Merged) 10jenkins-bot: mediawiki: fix cpu requests unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/731030 (owner: 10Giuseppe Lavagetto) [11:55:56] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:57:04] (03PS6) 10Arturo Borrero Gonzalez: openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) [11:58:59] (03PS9) 10Jbond: standard::ntp: move standard ntp to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/730852 [11:59:11] (03CR) 10Jbond: standard::ntp: move standard ntp to its own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [12:01:03] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix more units [deployment-charts] - 10https://gerrit.wikimedia.org/r/731038 [12:01:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix more units [deployment-charts] - 10https://gerrit.wikimedia.org/r/731038 (owner: 10Giuseppe Lavagetto) [12:03:31] (03PS7) 10Arturo Borrero Gonzalez: openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) [12:05:55] (03Merged) 10jenkins-bot: mediawiki: fix more units [deployment-charts] - 10https://gerrit.wikimedia.org/r/731038 (owner: 10Giuseppe Lavagetto) [12:32:46] (03PS1) 10Muehlenhoff: Add more role contacts [puppet] - 10https://gerrit.wikimedia.org/r/731093 [12:33:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitaly,gitlab,nginx,redis_gitlab,sidekiq,workhorse} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:33:24] ^thats me, expected [12:35:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:41:36] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) I re-enabled puppet on `gitlab1001` and uid/gid change and git user configuration was successful. ` jelto@gitlab1001:... [12:52:10] (03PS1) 10Muehlenhoff: Fix role handling for canaries [puppet] - 10https://gerrit.wikimedia.org/r/731094 [12:58:20] (03PS1) 10Muehlenhoff: Add more role contacts [puppet] - 10https://gerrit.wikimedia.org/r/731095 [13:06:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10lbowmaker) Thanks all. I have generated an ssh key. [13:09:23] (03PS2) 10David Caro: p:environment: Move wmcs specific etc files to p:wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) [13:09:25] (03CR) 10David Caro: p:environment: Move wmcs specific etc files to p:wmcs::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:09:28] (03CR) 10David Caro: [V: 03+1] p:environment: Move wmcs specific etc files to p:wmcs::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:11:33] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:48] (03CR) 10Jbond: [C: 03+1] p:environment: Move wmcs specific etc files to p:wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:13:03] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team (Kanban): auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10Ladsgroup) The comment above has my +1 [13:13:19] (03PS3) 10David Caro: p:environment: Move wmcs specific etc files to p:wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) [13:13:27] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:14:21] !log upload acme-chief 0.34 to apt.wikimedia.org (buster) - T292619 [13:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:30] T292619: Implement a watchdog mechanism on acme-chief - https://phabricator.wikimedia.org/T292619 [13:15:15] (03CR) 10David Caro: "experimental :facepalm: xd, I'll run manually each to be sure" [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:16:25] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31721/console" [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:17:05] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Support systemd watchdog [puppet] - 10https://gerrit.wikimedia.org/r/730016 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [13:18:26] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Enable watchdog on acmechief-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/731018 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [13:18:48] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31722/console" [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:19:07] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [13:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:22] (03CR) 10David Caro: [V: 03+1 C: 03+2] p:environment: Move wmcs specific etc files to p:wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:20:30] (03PS4) 10David Caro: p:environment: Move wmcs specific etc files to p:wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) [13:20:44] (03CR) 10Ladsgroup: "Duplicate of I785714c00a5" [puppet] - 10https://gerrit.wikimedia.org/r/731027 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [13:21:17] !log updating acme-chief to version 0.34 on acmechief-test instances - T292619 [13:21:21] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:23] T292619: Implement a watchdog mechanism on acme-chief - https://phabricator.wikimedia.org/T292619 [13:22:48] (03PS2) 10Gehel: wdqs: enable the streaming updater on wdqs2004 [puppet] - 10https://gerrit.wikimedia.org/r/730799 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:23:46] (03CR) 10Gehel: [C: 03+2] wdqs: enable the streaming updater on wdqs2004 [puppet] - 10https://gerrit.wikimedia.org/r/730799 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:24:54] (03CR) 10David Caro: [C: 03+2] p:environment: Move wmcs specific etc files to p:wmcs::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731012 (https://phabricator.wikimedia.org/T289661) (owner: 10David Caro) [13:24:59] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:18] (03CR) 10Michael Große: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/731098 (owner: 10Michael Große) [13:28:16] (03PS1) 10Muehlenhoff: Update tracking data [puppet] - 10https://gerrit.wikimedia.org/r/731100 [13:29:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10CDanis) @lbowmaker Great, thanks. Please paste somewhere in this task the public part of the key (a line of text that will start with something like `ssh-ed25519` or `ssh... [13:29:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:30:06] (03PS2) 10Gehel: wdqs: enable the streaming updater on wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/730814 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:30:49] (03CR) 10Gehel: [C: 03+2] wdqs: enable the streaming updater on wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/730814 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [13:30:59] !log start topic rebalancing for kafka main-eqiad (long maintenance, it will last a couple of days) [13:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:06] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:32:14] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [13:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:41] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10lbowmaker) ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIDFAMA39ztgFi5ECZb5JUN8BEUNR6UdckIzZBP8gQq9 lbowmaker@wikimedia.org Thanks! [13:36:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "I think this makes some sense anyways – the average DispatchChanges job creates more than one EntityChangeNotification job (https://grafan" [deployment-charts] - 10https://gerrit.wikimedia.org/r/731098 (owner: 10Michael Große) [13:36:48] (03CR) 10Muehlenhoff: [C: 03+2] Update tracking data [puppet] - 10https://gerrit.wikimedia.org/r/731100 (owner: 10Muehlenhoff) [13:39:38] (03CR) 10Jbond: [C: 03+1] Add more role contacts [puppet] - 10https://gerrit.wikimedia.org/r/731095 (owner: 10Muehlenhoff) [13:41:18] (03PS1) 10Vgutierrez: systemd: Allow paging on a systemd::service failure [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) [13:44:11] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.07% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:58:27] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review, and 3 others: Refactor puppet:base module to reduce unneeded shared code paths - https://phabricator.wikimedia.org/T289661 (10dcaro) [14:01:30] 10SRE, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) 05Openβ†’03Invalid Since we have no mcrouter proxies, and we won't have any scap proxies in the future, closing. [14:06:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Seems fairly simple and reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [14:13:59] (03PS8) 10Arturo Borrero Gonzalez: openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) [14:15:46] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:15:48] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31724/" [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [14:15:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cinder backups: introduce ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/730829 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [14:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:51] (03CR) 10Herron: [C: 03+1] "LGTM, and the related watchdog support looks very nice as well" [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [14:20:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] systemd: Allow paging on a systemd::service failure [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [14:22:26] (03CR) 10Jbond: [C: 03+1] systemd: Allow paging on a systemd::service failure [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [14:23:21] (03PS1) 10Giuseppe Lavagetto: deployment_server: add common version for rsyslog image [puppet] - 10https://gerrit.wikimedia.org/r/731110 [14:24:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server: add common version for rsyslog image [puppet] - 10https://gerrit.wikimedia.org/r/731110 (owner: 10Giuseppe Lavagetto) [14:24:32] (03CR) 10Herron: "thanks for the reviews! will deploy this early next week then" [puppet] - 10https://gerrit.wikimedia.org/r/730843 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [14:30:55] (03PS1) 10David Caro: grid_configurator: Added new naming schemes [puppet] - 10https://gerrit.wikimedia.org/r/731111 (https://phabricator.wikimedia.org/T292465) [14:31:19] !run kafka preferred-replica-election on kafka-main1001 to rebalance partition leaders - T288825 [14:31:19] T288825: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 [14:31:36] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] grid_configurator: Added new naming schemes [puppet] - 10https://gerrit.wikimedia.org/r/731111 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [14:37:20] (03PS1) 10David Caro: wmcs-srpeadcheck-tools: add new shorter webgrid names [puppet] - 10https://gerrit.wikimedia.org/r/731113 (https://phabricator.wikimedia.org/T292465) [14:37:22] (03PS1) 10David Caro: tools-clush-generator: add the shorter webgrid names [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) [14:37:57] (03CR) 10David Caro: [C: 03+2] grid_configurator: Added new naming schemes [puppet] - 10https://gerrit.wikimedia.org/r/731111 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [14:41:45] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [14:42:07] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10jijiki) 05Openβ†’03Declined We are not using proxies anymore, but some TKOs we see every now and then could be related to T291385, not much we can d... [14:48:08] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:39] (03CR) 10Accraze: [C: 03+1] Update api-gateway chart's comment about service routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/730963 (owner: 10Elukey) [14:54:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10CDanis) @Ottomata or @odimitrijevic please approve, thanks! [14:54:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10CDanis) also @DAbad please confirm you approve as well [15:02:34] (03PS1) 10Btullis: Remove alluxio from the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/731115 (https://phabricator.wikimedia.org/T266641) [15:04:57] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31725/console" [puppet] - 10https://gerrit.wikimedia.org/r/731115 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [15:08:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:18:31] (03PS2) 10Gehel: wdqs: enable the streaming updater on wdqs2007 [puppet] - 10https://gerrit.wikimedia.org/r/730800 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [15:18:37] (03PS2) 10Gehel: wdqs: enable the streaming updater on wdqs1008 [puppet] - 10https://gerrit.wikimedia.org/r/730815 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [15:21:01] (03CR) 10Gehel: [C: 03+2] wdqs: enable the streaming updater on wdqs2007 [puppet] - 10https://gerrit.wikimedia.org/r/730800 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [15:21:16] (03PS3) 10Gehel: wdqs: enable the streaming updater on wdqs1008 [puppet] - 10https://gerrit.wikimedia.org/r/730815 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [15:22:22] (03CR) 10Gehel: [C: 03+2] wdqs: enable the streaming updater on wdqs1008 [puppet] - 10https://gerrit.wikimedia.org/r/730815 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [15:22:45] (03CR) 10Ema: [C: 03+1] "One very minor nit, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [15:23:42] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:23:45] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:14] (03PS1) 10CDanis: shell access & analytics w/ krb for lbowmaker [puppet] - 10https://gerrit.wikimedia.org/r/731120 (https://phabricator.wikimedia.org/T293241) [15:25:05] (03CR) 10CDanis: [C: 03+2] shell access & analytics w/ krb for lbowmaker [puppet] - 10https://gerrit.wikimedia.org/r/731120 (https://phabricator.wikimedia.org/T293241) (owner: 10CDanis) [15:25:26] (03PS2) 10Btullis: Remove alluxio resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/731115 (https://phabricator.wikimedia.org/T266641) [15:27:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:28:05] (03CR) 10RLazarus: "Joe: Adding two questions I meant to ask, about how this ought to work." [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/723663 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [15:28:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10CDanis) 05Openβ†’03Resolved a:03CDanis Access granted to the wmf group in LDAP, so you should be able to access web-based tools. Shell account c... [15:29:52] (03CR) 10Elukey: [C: 03+1] "😞" [puppet] - 10https://gerrit.wikimedia.org/r/731115 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [15:32:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10CDanis) 05Openβ†’03Stalled p:05Triageβ†’03Medium [15:45:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/731101 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [15:51:45] (03PS1) 10Giuseppe Lavagetto: mediawiki: actually load the kafka module in rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/731123 [15:58:42] (03CR) 10Dzahn: [C: 03+2] "wonder why this wasn't merged by jenkins" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730938 (owner: 10Dzahn) [15:59:06] (03PS1) 10Herron: kafka_shipper: point codfw hosts to kafak-logging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/731125 (https://phabricator.wikimedia.org/T293439) [15:59:44] (03CR) 10jerkins-bot: [V: 04-1] kafka_shipper: point codfw hosts to kafak-logging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/731125 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [16:00:19] (03PS2) 10Herron: kafka_shipper: point codfw hosts to kafka-logging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/731125 (https://phabricator.wikimedia.org/T293439) [16:00:49] (03CR) 10jerkins-bot: [V: 04-1] kafka_shipper: point codfw hosts to kafka-logging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/731125 (https://phabricator.wikimedia.org/T293439) (owner: 10Herron) [16:02:28] (03PS3) 10Herron: kafka_shipper: point codfw hosts to kafka-logging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/731125 (https://phabricator.wikimedia.org/T293439) [16:02:40] (03Merged) 10jenkins-bot: miscweb: upgrade prod to 2021-10-12-192016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730938 (owner: 10Dzahn) [16:14:32] (03PS1) 10Hashar: Introduce lint command [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/731149 (https://phabricator.wikimedia.org/T283855) [16:15:30] (03CR) 10Jcrespo: [C: 03+1] "CCing Manuel and Stevie Beth for awareness, as there may be other script to change (although less urgent, as probably there won't be any d" [puppet] - 10https://gerrit.wikimedia.org/r/730793 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:17:37] (03CR) 10jerkins-bot: [V: 04-1] Introduce lint command [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/731149 (https://phabricator.wikimedia.org/T283855) (owner: 10Hashar) [16:22:08] jouncebot: nowandnext [16:22:08] For the next 14 hour(s) and 37 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211015T0700) [16:22:08] In 14 hour(s) and 37 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211016T0700) [16:22:16] oh, it's friday [16:24:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for echetty - https://phabricator.wikimedia.org/T293455 (10Ottomata) approved [16:29:16] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) Topics moved today and their timings (main-eqiad): ` Oct 15 16:25 codfw.mediawiki.job.cdnPurge.json Oct 15 16:21 eqiad.mediawiki.job.cdnPurge.json Oct 15 16:15 codfw.mediawiki... [16:44:31] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [16:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:58] !log gitlab2001 - temp stopped puppet - debugging gitlab restore script with Arnold [16:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:46] (03PS1) 10Jbond: sre: add contool aware SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/731153 [16:53:04] 10SRE, 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q2): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) [17:01:21] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:18] (03PS1) 10AOkoth: gitlab: redirect out to logfile in restore script [puppet] - 10https://gerrit.wikimedia.org/r/731154 (https://phabricator.wikimedia.org/T285867) [17:05:17] !log gitlab2001 - temp stopped puppet - debugging gitlab restore script with Arnold - T283076 [17:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:23] T283076: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 [17:06:24] (03CR) 10Dzahn: [C: 03+2] gitlab: redirect out to logfile in restore script [puppet] - 10https://gerrit.wikimedia.org/r/731154 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [17:11:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:12:56] (03PS1) 10Dzahn: gitlab: re-enable timer for backup-restore script [puppet] - 10https://gerrit.wikimedia.org/r/731155 (https://phabricator.wikimedia.org/T285867) [17:13:32] (03CR) 10AOkoth: [C: 03+1] gitlab: re-enable timer for backup-restore script [puppet] - 10https://gerrit.wikimedia.org/r/731155 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [17:13:51] (03CR) 10Dzahn: [C: 03+2] gitlab: re-enable timer for backup-restore script [puppet] - 10https://gerrit.wikimedia.org/r/731155 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [17:14:02] (03PS2) 10Dzahn: gitlab: re-enable timer for backup-restore script [puppet] - 10https://gerrit.wikimedia.org/r/731155 (https://phabricator.wikimedia.org/T285867) [17:17:31] !log gitlab1001 - disabling puppet for debugging [17:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:57] (03PS2) 10Jbond: cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 [17:20:30] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:45] (03CR) 10Jbond: "thanks updatedd" [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 (owner: 10Jbond) [17:21:59] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 (owner: 10Jbond) [17:22:55] (03PS3) 10Jbond: cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 [17:22:57] (03CR) 10Jbond: cookbooks sre: update run_scripts to accept a list of scripts not functions (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 (owner: 10Jbond) [17:23:40] (03CR) 10Accraze: hemlfile.d: add the inference service to api-gateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/730965 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [17:23:59] (03PS4) 10Jbond: cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 [17:24:39] (03CR) 10Accraze: [C: 03+1] "Nice one Luca!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730966 (https://phabricator.wikimedia.org/T288789) (owner: 10Elukey) [17:26:35] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 (owner: 10Jbond) [17:26:45] (03PS3) 10Jbond: cookbook sre: update SREBatchBase/SREBatchRunnerBase with minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/730506 [17:26:57] (03CR) 10Jbond: cookbook sre: update SREBatchBase/SREBatchRunnerBase with minor fixes (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/730506 (owner: 10Jbond) [17:27:20] (03PS5) 10Jbond: cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 [17:27:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,redis_gitlab,sidekiq,workhorse} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:29:46] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,redis_gitlab,sidekiq,workhorse} site=codfw daniel_zahn debugging going on https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:30:05] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 (owner: 10Jbond) [17:38:59] (03PS2) 10Jbond: mediawiki: add get_primary_dc function [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 [17:39:42] (03CR) 10Jbond: "thanks updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [17:40:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:44:50] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add get_primary_dc function [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [17:46:44] (03PS1) 10Dzahn: Revert "gitlab: re-enable timer for backup-restore script" [puppet] - 10https://gerrit.wikimedia.org/r/731127 [17:47:21] (03CR) 10AOkoth: [C: 03+1] Revert "gitlab: re-enable timer for backup-restore script" [puppet] - 10https://gerrit.wikimedia.org/r/731127 (owner: 10Dzahn) [17:47:42] (03CR) 10Dzahn: [C: 03+2] Revert "gitlab: re-enable timer for backup-restore script" [puppet] - 10https://gerrit.wikimedia.org/r/731127 (owner: 10Dzahn) [18:17:36] What happened to "$wgExemptFromUserRobotsControl" in InitialiseSettings.php? I can't find it, https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php [18:19:23] Documentation: https://www.mediawiki.org/wiki/Manual:$wgExemptFromUserRobotsControl [18:23:48] Hello? [18:34:34] see https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/2fb1ec93a4fdca25f027ff69bddabe24caeaceba/wmf-config/CommonSettings.php#3982 and https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/2fb1ec93a4fdca25f027ff69bddabe24caeaceba/wmf-config/InitialiseSettings.php#14077 [18:37:12] legoktm: So it's now being represented by "$wgContentNamespaces" and "$wmgExemptFromUserRobotsControlExtra"? These I managed to find [18:37:24] on some wikis, yes [18:39:16] Thanks legoktm, now I will follow with a task adding in "$wgContentNamespaces" [18:44:25] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:46] PROBLEM - WDQS high update lag on wdqs1011 is CRITICAL: 5060 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:06:30] (03PS1) 10CDanis: Filter prom-exported NEL stats to <=10min old reports [puppet] - 10https://gerrit.wikimedia.org/r/731166 (https://phabricator.wikimedia.org/T257527) [19:09:50] RECOVERY - WDQS high update lag on wdqs1011 is OK: (C)3600 ge (W)1200 ge 1156 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:12:12] (03CR) 10CDanis: "BTW I tried this at https://logstash.wikimedia.org/app/dev_tools#/console and I think it works" [puppet] - 10https://gerrit.wikimedia.org/r/731166 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [19:30:07] (03CR) 10Cwhite: [V: 03+1 C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/731166 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [19:38:41] (03CR) 10CDanis: [C: 03+2] Filter prom-exported NEL stats to <=10min old reports [puppet] - 10https://gerrit.wikimedia.org/r/731166 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [20:03:02] (03PS1) 10CDanis: Add rate of high-signal NELs as a status page metric [puppet] - 10https://gerrit.wikimedia.org/r/731171 (https://phabricator.wikimedia.org/T285569) [20:07:07] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: enable the streaming updater on wdqs1011 [puppet] - 10https://gerrit.wikimedia.org/r/730816 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [20:09:13] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:11] (03PS2) 10CDanis: Add rate of high-signal NELs as a status page metric [puppet] - 10https://gerrit.wikimedia.org/r/731171 (https://phabricator.wikimedia.org/T285569) [20:36:06] (03CR) 10Dzahn: [C: 03+1] osm: convert common role to profile, avoid role inside role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730862 (owner: 10Dzahn) [20:36:54] (03CR) 10Dzahn: [C: 03+2] osm: convert common role to profile, avoid role inside role [puppet] - 10https://gerrit.wikimedia.org/r/730862 (owner: 10Dzahn) [20:38:41] (03CR) 10Cwhite: [C: 03+2] logstash: deploy ecs patch 5 [puppet] - 10https://gerrit.wikimedia.org/r/730588 (owner: 10Cwhite) [20:39:22] (03CR) 10Dzahn: "compiler failure in deployment-prep is unrelated: Function lookup() did not find a value for the name 'profile::maps::cassandra::kartother" [puppet] - 10https://gerrit.wikimedia.org/r/730862 (owner: 10Dzahn) [20:39:37] mutante: ready to merge? [20:39:52] cwhite: yes! [20:40:10] done :) [20:40:14] you got the lock file, go ahead, thanks! [20:42:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_streaming_updater site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:43:15] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:43:25] (03CR) 10Dzahn: peek: replace crons with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:43:52] (03CR) 10Dzahn: peek: replace crons with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:44:03] (03PS3) 10Dzahn: peek: replace crons with timers [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) [20:48:36] (03CR) 10Dzahn: "asking Scott" [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:59:37] (03CR) 10SBassett: peek: replace crons with timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:59:44] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10brennen) Looks good from my end - looks like there's some ongoing work with restore scripts, but feel free to resolve once th... [21:04:33] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Dzahn) That's right. The restore script works when started manually but it does not work, and unfortunately breaks things, wh... [21:09:04] (03CR) 10Dzahn: "Thanks Scott!, @Moritz / John ^ easiest would be to just merge, but I can amend :)" [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:19:12] (03CR) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [21:20:38] (03CR) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [21:21:12] (03Abandoned) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [21:21:47] (03PS1) 10BryanDavis: toolhub: Add URLLIB3_DISABLE_WARNINGS envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/731181 (https://phabricator.wikimedia.org/T292025) [21:28:05] (03PS1) 10Dzahn: simplelamp2: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731183 [21:28:07] (03PS1) 10Dzahn: simplelap: declare httpd::mpm explicitly and use prefork MPM [puppet] - 10https://gerrit.wikimedia.org/r/731184 [21:30:55] (03PS4) 10Dzahn: peek: drop cron class [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) [21:31:50] (03CR) 10Dzahn: [C: 03+2] "thanks all, merging, dropped cron part with a comment, kept the main class." [puppet] - 10https://gerrit.wikimedia.org/r/730867 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:33:37] dpifke: should we try to restart this? The following units failed: excimer-wall-log.service on webperf1002 and 2002 [21:34:01] PROBLEM - Disk space on aqs1012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-b 114032 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1012&var-datasource=eqiad+prometheus/ops [21:34:12] Will take a look shortly. [21:34:28] ACK, thanks [21:36:32] !log dzahn@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [21:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:44:04] !log dzahn@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [21:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:35] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:26] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: enable the streaming updater on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/730817 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [21:51:22] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:59] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow query handling - https://phabricator.wikimedia.org/T293530 (10Legoktm) [21:55:09] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10Legoktm) [21:55:34] (03PS1) 10Dzahn: miscweb: bump staging and prod to 2021-10-13-225516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/731186 [21:56:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:56:13] (03CR) 10jerkins-bot: [V: 04-1] miscweb: bump staging and prod to 2021-10-13-225516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/731186 (owner: 10Dzahn) [21:56:47] (03PS2) 10Dzahn: miscweb: bump staging and prod to 2021-10-13-225516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/731186 [21:57:03] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Legoktm) [22:01:47] 10SRE, 10DBA, 10Sustainability (Incident Followup): Lower automatic query killing threshold to 55 seconds - https://phabricator.wikimedia.org/T293533 (10Legoktm) [22:04:49] 10SRE, 10DBA, 10Sustainability (Incident Followup): Reimplement HHVM-like slow query log - https://phabricator.wikimedia.org/T293534 (10Legoktm) [22:05:22] !log dpifke@deploy1002 Started deploy [performance/arc-lamp@40cb764]: Revert problematic arclamp patch to fix daemon crashes [22:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:28] !log dpifke@deploy1002 Finished deploy [performance/arc-lamp@40cb764]: Revert problematic arclamp patch to fix daemon crashes (duration: 00m 05s) [22:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:46] (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging and prod to 2021-10-13-225516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/731186 (owner: 10Dzahn) [22:07:47] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:27] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:40] :) [22:09:15] I have no idea why that suddenly became problematic. I fixed it for now by reverting the patch that introduced that code (which has been running fine for months!). Will dig deeper on Monday. [22:10:01] ACK, thanks for the update [22:10:27] Thanks for passing on the alert. I probably wouldn't have noticed otherwise. [22:10:32] yw [22:11:02] just checking icinga web UI on a Friday [22:11:13] (03Merged) 10jenkins-bot: miscweb: bump staging and prod to 2021-10-13-225516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/731186 (owner: 10Dzahn) [22:14:07] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve slow query handling - https://phabricator.wikimedia.org/T293530 (10Legoktm) The one subtask I haven't filed yet because I haven't had the chance to verify it is having excimer be able to interrupt C functions like `mysqli_*`. My theory is that if MW... [22:14:31] !log dzahn@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [22:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:59] ACKNOWLEDGEMENT - Device not healthy -SMART- on labweb1002 is CRITICAL: cluster=misc device=sdb instance=labweb1002 job=node site=eqiad daniel_zahn https://phabricator.wikimedia.org/T293428 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labweb1002&var-datasource=eqiad+prometheus/ops [22:18:47] !log dzahn@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [22:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:17] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Dzahn) @Andrew One disk out of 2 in one out of 2 backends of labweb.svc failed. Fairly urgent? Medium priority? Should we ping @Jclark-ctr directly about coordinating this with your... [22:33:00] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Jclark-ctr) @Andrew @Dzahn Server is out of warranty. Are you able to tell me what size hard drive it is? I can tell you if i have a spare. Is this urgent? i am not on site at... [22:34:56] !log apt2001 - upgraded nginx [22:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:57] !log apt2001 - removing nginx package, accidentally installed, should just be nginx-light of course, running puppet [22:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:14] !log apt1001 - removing nginx package, accidentally installed, should just be nginx-light of course, running puppet [22:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:55] (03PS1) 10Urbanecm: growthexperiments/updatementeedata: Run updates every three hours [puppet] - 10https://gerrit.wikimedia.org/r/731192 (https://phabricator.wikimedia.org/T293447) [22:43:22] (03CR) 10Urbanecm: [C: 04-1] "DNM, yet" [puppet] - 10https://gerrit.wikimedia.org/r/731192 (https://phabricator.wikimedia.org/T293447) (owner: 10Urbanecm) [22:43:32] (03CR) 10Urbanecm: [C: 04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/731192 (https://phabricator.wikimedia.org/T293447) (owner: 10Urbanecm) [22:46:51] (03CR) 10BryanDavis: [C: 03+2] toolhub: Add URLLIB3_DISABLE_WARNINGS envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/731181 (https://phabricator.wikimedia.org/T292025) (owner: 10BryanDavis) [22:51:22] (03Merged) 10jenkins-bot: toolhub: Add URLLIB3_DISABLE_WARNINGS envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/731181 (https://phabricator.wikimedia.org/T292025) (owner: 10BryanDavis) [23:00:37] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:06:30] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Dzahn) p:05Triageβ†’03Medium @Jclark-ctr I would think it's important but not important enough that you go outside of normal work hours. But I am not speaking for wmcs team, just r... [23:10:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:12:04] (03PS1) 10BryanDavis: toolhub: Bump container version & set URLLIB3_DISABLE_WARNINGS=True [deployment-charts] - 10https://gerrit.wikimedia.org/r/731196 (https://phabricator.wikimedia.org/T292025) [23:15:34] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Jclark-ctr) please ping me on irc if this is something that is urgent i can go in if needed [23:19:32] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Bstorm) The bad one is not responding, naturally :) [23:19:43] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Bstorm) The good disk looks like this: ` smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-14-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmo... [23:20:55] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on labweb1002 - https://phabricator.wikimedia.org/T293428 (10Dzahn) @Jclark-ctr It's a 1 TB disk. I confirmed with others it can wait until after the weekend. No need to go in but would be great next week. Thanks for the quick reaction and of... [23:23:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:06] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [23:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:26] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: enable the streaming updater on wdqs1005 [puppet] - 10https://gerrit.wikimedia.org/r/730818 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [23:48:23] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log