[00:04:13] (03CR) 10Bstorm: [C: 03+1] "You've already got the separate alertmanager set up as I recall. Are you ready for merge?" [puppet] - 10https://gerrit.wikimedia.org/r/705632 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [00:56:03] oof, deneb is out of disk space [01:04:31] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10Legoktm) p:05Triage→03Unbreak! [01:07:07] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10Legoktm) user homes over 1G: ` 1.4G ema 2.6G jbond 5.2G jmm 5.9G filippo 7.7G razzi 7.9G akosiaris 8.6G elukey 41G otto ` Please see if something can be cleaned up [01:18:05] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10Legoktm) 70G in /var/lib/docker...we have some pretty old images that can also be cleaned up from there [01:20:58] !log legoktm@deneb:~$ docker rmi docker-registry.wikimedia.org/mwcachedir:0.0.1 # T287222 [01:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:09] T287222: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 [01:21:26] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10Legoktm) >>! In T287222#7231796, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/Y_3z0HoB1jz_IcWuhNFb} [2021-07-23T01:20:57Z] legoktm@deneb... [01:56:47] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:41] 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10Shizhao) [02:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [02:14:25] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (7) node(s) change every puppet run: labstore1006, ganeti2025, ganeti2026, thanos-be1003, registry2004, registry1003, gitlab2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [02:26:19] !log [WDQS] Pooled `wdqs1004` (all caught up on its mountain of lag) [02:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:15] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ganeti2025, ganeti2026, gitlab2001, labstore1006, thanos-be1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [02:57:10] (03CR) 10Cwhite: "Looks good to me, but Filippo should have a look." (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [02:59:37] (03CR) 10Cwhite: "setup.py and exporter.py look ok to me. Filippo should have a quick look at the alertmanager side of things." [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706476 (owner: 10David Caro) [03:00:26] (03CR) 10Cwhite: [C: 03+1] "LGTM" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706477 (owner: 10David Caro) [03:02:15] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: gitlab2001, ganeti2025, thanos-be1003, labstore1006, ganeti2026 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [03:02:56] (03CR) 10Cwhite: [C: 03+1] "LGTM" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706478 (owner: 10David Caro) [03:06:35] !log T287223 Installed `nginx-light` on all of `elastic2*` (codfw) [03:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:45] T287223: Upgrade to nginx-light package on elastic* hosts - https://phabricator.wikimedia.org/T287223 [03:09:26] !log T287223 Installed `nginx-light` on all of `elastic1*` (eqiad) [03:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:36] !log T287223 Installed `nginx-light` on all of `cloudelastic*`, and it looks like `relforge` didn't need the upgrade. This operation is done. [03:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:45] T287223: Upgrade to nginx-light package on elastic* hosts - https://phabricator.wikimedia.org/T287223 [03:12:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:14:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:30:10] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/706509 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [03:30:49] (03CR) 10Cwhite: [C: 03+1] hieradata: configure thanos rule hosts [puppet] - 10https://gerrit.wikimedia.org/r/706510 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [03:30:55] (03CR) 10Cwhite: [C: 03+1] role: activate thanos::rule profile on thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/706511 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [03:31:13] (03CR) 10Cwhite: [C: 03+1] prometheus: pull metrics from thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/706512 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [03:31:26] (03CR) 10Cwhite: [C: 03+1] thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [03:37:12] (03CR) 10KartikMistry: Add stream configuration for ContentTranslation events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [03:37:14] (03PS8) 10KartikMistry: Add stream configuration for ContentTranslation events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) [05:14:16] 10SRE, 10ops-codfw, 10DBA: db2091 memory errors - https://phabricator.wikimedia.org/T287182 (10Marostegui) 05Open→03Resolved Closing this as the modules were swapped, we'll see if it happens again, if so, let's reopen. BIOS and firmware were upgraded too [05:22:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] helmfile: allow performing a rolling restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/706293 (owner: 10Giuseppe Lavagetto) [05:25:18] (03Merged) 10jenkins-bot: helmfile: allow performing a rolling restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/706293 (owner: 10Giuseppe Lavagetto) [05:41:30] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) So after discussion yesterday, it appears we've come to a consensus that given we're now building incremental... [05:46:50] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/706722 (owner: 10Ssingh) [05:47:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good :-)" [puppet] - 10https://gerrit.wikimedia.org/r/706722 (owner: 10Ssingh) [06:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [06:36:46] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) Things that get served statically include: * Favicons (like https://en.wikipedia.org/static/favicon/wikipedi... [06:37:38] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10elukey) Thanks for the ping! Freed stuff on my home dir: ` elukey@deneb:~$ du -hs 136M . ` [06:43:36] (03CR) 10Filippo Giunchedi: pontoon: initialize $_role on bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705661 (owner: 10Filippo Giunchedi) [06:46:45] 10SRE: open my INC/LLC inn norway - https://phabricator.wikimedia.org/T287229 (10Albiyoung) [06:52:02] 10SRE: open my INC/LLC inn norway - https://phabricator.wikimedia.org/T287229 (10Albiyoung) a:03Albiyoung Ozioninc llc holdings INC [06:52:08] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10elukey) p:05Unbreak!→03High Lowering down priority since we have now 22G available. [06:55:52] (03PS1) 10Marostegui: wmnet: Switchover m1-master from dbproxy1012 to dbproxy1014 [dns] - 10https://gerrit.wikimedia.org/r/707221 (https://phabricator.wikimedia.org/T286061) [06:57:00] (03PS2) 10Marostegui: wmnet: Switchover m1-master from dbproxy1014 to dbproxy1012 [dns] - 10https://gerrit.wikimedia.org/r/707221 (https://phabricator.wikimedia.org/T286061) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210723T0700) [07:00:21] (03CR) 10Marostegui: [C: 03+2] wmnet: Switchover m1-master from dbproxy1014 to dbproxy1012 [dns] - 10https://gerrit.wikimedia.org/r/707221 (https://phabricator.wikimedia.org/T286061) (owner: 10Marostegui) [07:02:16] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) m1-master.eqiad.wmnet switched over to dbproxy1012 which is on row A. Once this row is done, we need to revert that. [07:02:18] (03PS2) 10Giuseppe Lavagetto: Add configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) [07:02:46] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [07:03:20] (03CR) 10jerkins-bot: [V: 04-1] Add configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) (owner: 10Giuseppe Lavagetto) [07:18:47] RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [07:20:43] (03CR) 10Dzahn: [C: 03+2] icinga/planet: use letsencrypt check command for https cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/706410 (https://phabricator.wikimedia.org/T286713) (owner: 10Dzahn) [07:24:41] (03CR) 10Muehlenhoff: "Looks good, a few nits inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [07:28:06] (03PS1) 10Elukey: profile::prometheus::k8s: collect more metrics from k8s controllers [puppet] - 10https://gerrit.wikimedia.org/r/707235 [07:30:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30303/console" [puppet] - 10https://gerrit.wikimedia.org/r/707235 (owner: 10Elukey) [07:30:32] (03CR) 10Dzahn: [C: 03+2] typos: add "the the" [puppet] - 10https://gerrit.wikimedia.org/r/705714 (owner: 10Dzahn) [07:32:31] (03PS1) 10Jelto: fix puma exporter listen address [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707236 (https://phabricator.wikimedia.org/T275170) [07:34:15] (03CR) 10Filippo Giunchedi: "Thank you for the review!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705662 (owner: 10Filippo Giunchedi) [07:34:29] (03PS2) 10Filippo Giunchedi: pontoon: initialize $_role on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705661 [07:34:31] (03PS2) 10Filippo Giunchedi: pontoon: initialize user bare repositories on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705662 [07:34:33] (03PS2) 10Filippo Giunchedi: pontoon: stop reading stack from hiera [puppet] - 10https://gerrit.wikimedia.org/r/705663 [07:34:35] (03PS2) 10Filippo Giunchedi: pontoon: create puppet client dir [puppet] - 10https://gerrit.wikimedia.org/r/705664 [07:34:37] (03PS2) 10Filippo Giunchedi: pontoon: add instructions [puppet] - 10https://gerrit.wikimedia.org/r/705665 [07:34:39] (03PS2) 10Filippo Giunchedi: pontoon: run puppet twice at enroll [puppet] - 10https://gerrit.wikimedia.org/r/705666 [07:34:41] (03PS2) 10Filippo Giunchedi: pontoon: always link hiera directory [puppet] - 10https://gerrit.wikimedia.org/r/705667 [07:35:12] (03PS2) 10Jelto: fix puma exporter listen address [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707236 (https://phabricator.wikimedia.org/T275170) [07:37:37] (03CR) 10Elukey: [V: 03+1] "Not sure if this is the best approach, but I am trying to collect metrics for calico to populate dashboards like https://grafana-rw.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/707235 (owner: 10Elukey) [07:38:40] 10ops-eqiad, 10DC-Ops: Relabel dbstore1004 to db1183 - https://phabricator.wikimedia.org/T286468 (10Marostegui) a:05Kormat→03wiki_willy [07:43:54] (03PS3) 10Giuseppe Lavagetto: Add configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) [07:48:31] (03CR) 10Jelto: "In the last change I added additional settings for puma and sidekiq exporters, see I5f5e5b33924b12d27bf5cc81dda5729b199c4553. Metrics for " [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707236 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [07:49:19] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: run puppet twice at enroll [puppet] - 10https://gerrit.wikimedia.org/r/705666 (owner: 10Filippo Giunchedi) [07:49:24] (03PS3) 10Filippo Giunchedi: pontoon: run puppet twice at enroll [puppet] - 10https://gerrit.wikimedia.org/r/705666 [07:52:06] (03PS4) 10Gehel: elastic: pull out execute_on_clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [07:53:54] (03CR) 10Filippo Giunchedi: [C: 03+1] global: added .gitreview file (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706478 (owner: 10David Caro) [07:54:15] (03CR) 10Filippo Giunchedi: [C: 03+1] global: ran flake8 on the code [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706477 (owner: 10David Caro) [07:55:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, the alertmanager_client module is auto-generated but I think that's fine" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706476 (owner: 10David Caro) [07:58:18] (03CR) 10Gehel: [C: 03+1] "LGTM" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/706276 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [08:02:05] (03CR) 10Elukey: [C: 03+1] "Took a bit of time to parse everything, but docstrings helped a lot. Some extra simple example in the docstring might help further but not" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [08:04:34] (03CR) 10Filippo Giunchedi: "LGTM overall, thank you (also for adding tests)!" (035 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [08:07:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/705019 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite) [08:12:06] (03PS1) 10Jcrespo: dbbackups: Reimage dbprov1002 to buster [puppet] - 10https://gerrit.wikimedia.org/r/707243 (https://phabricator.wikimedia.org/T287230) [08:13:06] (03CR) 10Elukey: [C: 03+1] "Left some comments for readability, it was a little challenging for me to fully grasp these two code reviews so some extra comments may he" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [08:24:00] !log run 'gnt-instance modify -t plain ml-serve-ctrl1002.eqiad.wmnet' on ganeti1009 as test to track down latency/perf issues with kubelets [08:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:34] (03PS1) 10Jcrespo: dbbackups: Reorganize backups after dbprov1002 reimage [puppet] - 10https://gerrit.wikimedia.org/r/707250 (https://phabricator.wikimedia.org/T287230) [08:28:09] (03PS2) 10David Caro: global: add .gitreview file [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706478 [08:28:14] (03CR) 10Filippo Giunchedi: pontoon: create puppet client dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705664 (owner: 10Filippo Giunchedi) [08:31:28] (03PS2) 10Jcrespo: dbbackups: Reorganize backups after dbprov1002 reimage [puppet] - 10https://gerrit.wikimedia.org/r/707250 (https://phabricator.wikimedia.org/T287230) [08:39:19] (03PS1) 10Dzahn: site/conftool: add mw1437 through mw1440 as appservers, rack D8 [puppet] - 10https://gerrit.wikimedia.org/r/707252 (https://phabricator.wikimedia.org/T279309) [08:40:49] (03PS2) 10Filippo Giunchedi: thanos: add rule to module/profile [puppet] - 10https://gerrit.wikimedia.org/r/706509 (https://phabricator.wikimedia.org/T287142) [08:40:51] (03PS3) 10Filippo Giunchedi: hieradata: configure thanos rule hosts [puppet] - 10https://gerrit.wikimedia.org/r/706510 (https://phabricator.wikimedia.org/T287142) [08:40:53] (03PS3) 10Filippo Giunchedi: role: activate thanos::rule profile on thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/706511 (https://phabricator.wikimedia.org/T287142) [08:40:55] (03PS3) 10Filippo Giunchedi: prometheus: pull metrics from thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/706512 (https://phabricator.wikimedia.org/T287142) [08:40:57] (03PS3) 10Filippo Giunchedi: thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) [08:41:01] (03CR) 10Dzahn: [C: 04-1] "1437 is already a jobrunner, fix doc" [puppet] - 10https://gerrit.wikimedia.org/r/707252 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [08:41:06] (03CR) 10Filippo Giunchedi: thanos: query rule component too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [08:41:38] (03CR) 10Filippo Giunchedi: thanos: add rule to module/profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706509 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [08:43:05] (03PS2) 10Dzahn: site/conftool: add mw1447 through mw1450 as appservers, rack D8 [puppet] - 10https://gerrit.wikimedia.org/r/707252 (https://phabricator.wikimedia.org/T279309) [08:43:15] (03CR) 10Kormat: pontoon: initialize user bare repositories on bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705662 (owner: 10Filippo Giunchedi) [08:43:46] mutante: is there any reason there's 2 sections for D8? Right above there's already a D8 entry [08:43:48] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [08:43:58] (03CR) 10Kormat: [C: 03+1] pontoon: initialize $_role on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705661 (owner: 10Filippo Giunchedi) [08:44:01] (03CR) 10Filippo Giunchedi: "No strong opinion tbh, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707235 (owner: 10Elukey) [08:44:21] (03CR) 10jerkins-bot: [V: 04-1] thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [08:44:51] RhinosF1: yes, a single one will be converted to different type afterwards [08:45:06] mutante: ah! [08:45:21] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: initialize $_role on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705661 (owner: 10Filippo Giunchedi) [08:45:58] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [08:46:52] (03CR) 10Elukey: [V: 03+1] profile::prometheus::k8s: collect more metrics from k8s controllers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/707235 (owner: 10Elukey) [08:46:54] (03PS2) 10Elukey: profile::prometheus::k8s: collect more metrics from k8s controllers [puppet] - 10https://gerrit.wikimedia.org/r/707235 [08:47:31] (03PS4) 10David Caro: am: Add team tags matcher file support [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) [08:47:33] (03CR) 10David Caro: am: Add team tags matcher file support (036 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [08:47:35] (03PS1) 10David Caro: global: add a simple requires.txt [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 [08:47:58] 10SRE, 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10ayounsi) 05Resolved→03Open Cable to msw1-eqiad is still pointing to the old device, causing an outstanding diff: `lang=diff Changes for 1 devices: ['msw1-eqiad.mgmt.eqiad.wmnet'] [edit interfaces ge-0/0/24] - description... [08:47:58] (03CR) 10Kormat: [C: 03+1] pontoon: stop reading stack from hiera [puppet] - 10https://gerrit.wikimedia.org/r/705663 (owner: 10Filippo Giunchedi) [08:48:21] (03PS1) 10Giuseppe Lavagetto: http-fcgi: switch to json logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/707257 (https://phabricator.wikimedia.org/T285384) [08:48:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) we are now about to get to the range mw1448 through mw1450 which are already in rack but not in DNS yet. Could you do these next? [08:50:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30305/console" [puppet] - 10https://gerrit.wikimedia.org/r/707235 (owner: 10Elukey) [08:52:36] (03PS3) 10Dzahn: site/conftool: add mw1439 through mw1442 as appservers, rack D8 [puppet] - 10https://gerrit.wikimedia.org/r/707252 (https://phabricator.wikimedia.org/T279309) [08:52:44] (03CR) 10Effie Mouzeli: [C: 03+1] "<3" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/707257 (https://phabricator.wikimedia.org/T285384) (owner: 10Giuseppe Lavagetto) [08:53:26] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [08:55:58] (03CR) 10Dzahn: [C: 03+2] site/conftool: add mw1439 through mw1442 as appservers, rack D8 [puppet] - 10https://gerrit.wikimedia.org/r/707252 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [08:56:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[1439-1442].eqiad.wmnet with reason: new host [08:56:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[1439-1442].eqiad.wmnet with reason: new host [08:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:00] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw1439.eqiad.wmnet [08:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:43] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw144[0-2].eqiad.wmnet [08:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:07:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:10:13] (03PS1) 10Muehlenhoff: Remove access for jkatz [puppet] - 10https://gerrit.wikimedia.org/r/707273 [09:13:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, can't meaningfully comment on the approach/semantics with respect to k8s though" [puppet] - 10https://gerrit.wikimedia.org/r/707235 (owner: 10Elukey) [09:14:36] (03PS1) 10Majavah: wikireplica_dns: fix s7 web aliases [puppet] - 10https://gerrit.wikimedia.org/r/707274 [09:14:49] (03CR) 10Filippo Giunchedi: [C: 03+1] global: add a simple requires.txt [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 (owner: 10David Caro) [09:16:14] (03CR) 10Kormat: [C: 03+1] pontoon: create puppet client dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705664 (owner: 10Filippo Giunchedi) [09:16:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jkatz [puppet] - 10https://gerrit.wikimedia.org/r/707273 (owner: 10Muehlenhoff) [09:20:05] !log hashar@deploy1002 Started deploy [integration/docroot@edae2b4]: doc: add footer link to wikitech documentation [09:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:16] !log hashar@deploy1002 Finished deploy [integration/docroot@edae2b4]: doc: add footer link to wikitech documentation (duration: 00m 11s) [09:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:36] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (033 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [09:28:00] (03PS1) 10Muehlenhoff: Remove LDAP access for Amanda Mooney [puppet] - 10https://gerrit.wikimedia.org/r/707280 [09:28:26] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 114 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:29:30] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 34 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:29:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for Amanda Mooney [puppet] - 10https://gerrit.wikimedia.org/r/707280 (owner: 10Muehlenhoff) [09:36:10] (03CR) 10Filippo Giunchedi: pontoon: initialize user bare repositories on bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705662 (owner: 10Filippo Giunchedi) [09:38:11] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [09:43:54] (03CR) 10Kormat: [C: 03+1] pontoon: initialize user bare repositories on bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705662 (owner: 10Filippo Giunchedi) [09:45:15] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add instructions [puppet] - 10https://gerrit.wikimedia.org/r/705665 (owner: 10Filippo Giunchedi) [09:45:17] (03CR) 10Kormat: [C: 03+1] pontoon: add instructions [puppet] - 10https://gerrit.wikimedia.org/r/705665 (owner: 10Filippo Giunchedi) [09:45:23] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: always link hiera directory [puppet] - 10https://gerrit.wikimedia.org/r/705667 (owner: 10Filippo Giunchedi) [09:45:32] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: stop reading stack from hiera [puppet] - 10https://gerrit.wikimedia.org/r/705663 (owner: 10Filippo Giunchedi) [09:45:46] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: create puppet client dir [puppet] - 10https://gerrit.wikimedia.org/r/705664 (owner: 10Filippo Giunchedi) [09:45:49] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: initialize user bare repositories on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705662 (owner: 10Filippo Giunchedi) [09:46:16] (03PS5) 10David Caro: am: Add team tags matcher file support [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) [09:46:18] (03PS2) 10David Caro: global: add a simple requires.txt [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 [09:46:38] (03PS3) 10Filippo Giunchedi: pontoon: initialize user bare repositories on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705662 [09:47:02] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1439.eqiad.wmnet [09:47:03] a little gerrit spam incoming, sorry about that [09:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:43] (03PS4) 10Filippo Giunchedi: pontoon: initialize user bare repositories on bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705662 [09:47:45] (03PS3) 10Filippo Giunchedi: pontoon: stop reading stack from hiera [puppet] - 10https://gerrit.wikimedia.org/r/705663 [09:47:47] (03PS3) 10Filippo Giunchedi: pontoon: create puppet client dir [puppet] - 10https://gerrit.wikimedia.org/r/705664 [09:47:49] (03PS3) 10Filippo Giunchedi: pontoon: add instructions [puppet] - 10https://gerrit.wikimedia.org/r/705665 [09:47:51] (03PS3) 10Filippo Giunchedi: pontoon: always link hiera directory [puppet] - 10https://gerrit.wikimedia.org/r/705667 [09:49:16] (03CR) 10David Caro: am: Add team tags matcher file support (033 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [09:49:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1440.eqiad.wmnet [09:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thank you for your help on this. See also below" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [09:52:58] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [09:57:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1441.eqiad.wmnet [09:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:57] (03PS1) 10Filippo Giunchedi: pontoon: fix lookup of public_domain in frontend [puppet] - 10https://gerrit.wikimedia.org/r/707296 [10:01:09] (03PS1) 10Filippo Giunchedi: profile: wait for apache2 in wmcs::instance sites-local [puppet] - 10https://gerrit.wikimedia.org/r/707297 [10:01:35] (03PS1) 10Dzahn: site/conftool: add mw1443 through mw1446 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/707298 (https://phabricator.wikimedia.org/T279309) [10:01:47] (03CR) 10jerkins-bot: [V: 04-1] pontoon: fix lookup of public_domain in frontend [puppet] - 10https://gerrit.wikimedia.org/r/707296 (owner: 10Filippo Giunchedi) [10:02:25] (03CR) 10jerkins-bot: [V: 04-1] site/conftool: add mw1443 through mw1446 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/707298 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [10:02:35] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1442.eqiad.wmnet [10:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:30] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [10:04:06] jouncebot: now [10:04:06] For the next 20 hour(s) and 55 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210723T0700) [10:04:22] I’ll test something on mwdebug2001, shouldn’t take more than 10 minutes [10:05:13] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [10:06:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) mw1444 is in DNS but not reachable via SSH (1443, 1445, 1446 are), could you take a look what's special with 1444 please [10:07:39] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:08:43] alright, I’m done testing on mwdebug2001, did a scap pull to wipe my changes [10:08:53] (03PS1) 10Jelto: add mcrouter certs for mw1422.eqiad.wmnet to mw1442.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/707300 (https://phabricator.wikimedia.org/T279309) [10:10:48] (03PS2) 10Dzahn: site/conftool: add mw1443 through mw1446 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/707298 (https://phabricator.wikimedia.org/T279309) [10:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [10:12:41] (03CR) 10Dzahn: [C: 03+1] "thank you! yes, this will make sure we can run puppet compiler on new hosts, though I just learned yesterday that we won't need to create " [labs/private] - 10https://gerrit.wikimedia.org/r/707300 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [10:15:30] (03PS2) 10Jelto: add mcrouter certs for mw1422.eqiad.wmnet to mw1442.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/707300 (https://phabricator.wikimedia.org/T279309) [10:15:43] (03CR) 10Dzahn: [C: 03+2] site/conftool: add mw1443 through mw1446 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/707298 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [10:15:50] (03PS3) 10Dzahn: site/conftool: add mw1443 through mw1446 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/707298 (https://phabricator.wikimedia.org/T279309) [10:17:20] (03CR) 10Dzahn: [C: 03+2] gerrit: config values do not need double quotes [puppet] - 10https://gerrit.wikimedia.org/r/706042 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [10:19:20] (03PS2) 10Filippo Giunchedi: pontoon: fix lookup of public_domain in frontend [puppet] - 10https://gerrit.wikimedia.org/r/707296 [10:21:50] (03CR) 10Dzahn: [C: 03+1] add mcrouter certs for mw1422.eqiad.wmnet to mw1442.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/707300 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [10:25:45] (03CR) 10Dzahn: [C: 03+1] "actually, can you go up to mw1446, please? I was just about to add those right now." [labs/private] - 10https://gerrit.wikimedia.org/r/707300 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [10:37:21] (03PS3) 10Jelto: add mcrouter certs for mw1422.eqiad.wmnet to mw1446.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/707300 (https://phabricator.wikimedia.org/T279309) [10:38:04] (03CR) 10Dzahn: [C: 03+1] add mcrouter certs for mw1422.eqiad.wmnet to mw1446.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/707300 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [10:45:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: Init thirdparty/kubeadm-k8s-1-19 [puppet] - 10https://gerrit.wikimedia.org/r/705972 (https://phabricator.wikimedia.org/T280340) (owner: 10Majavah) [10:46:00] (03CR) 10Majavah: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/705632 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [10:51:55] PROBLEM - Check systemd state on mw1443 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:55] RECOVERY - Check systemd state on mw1443 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:58] (03PS1) 10Majavah: aptrepo: fix component for helm on kubeadm k8s 1.19 [puppet] - 10https://gerrit.wikimedia.org/r/707311 [10:55:21] PROBLEM - mediawiki-installation DSH group on mw1445 is CRITICAL: Host mw1445 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:55:21] PROBLEM - memcached socket on mw1443 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory https://wikitech.wikimedia.org/wiki/Memcached [10:56:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix component for helm on kubeadm k8s 1.19 [puppet] - 10https://gerrit.wikimedia.org/r/707311 (owner: 10Majavah) [10:57:39] PROBLEM - memcached socket on mw1445 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory https://wikitech.wikimedia.org/wiki/Memcached [10:58:38] !log adding packages to buster-wikimedia/thirdparty/kubeadm-k8s-1-19 @ apt1001 [10:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[1443,1445-1446].eqiad.wmnet with reason: new host [11:00:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[1443,1445-1446].eqiad.wmnet with reason: new host [11:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:48] (03PS1) 10Lucas Werkmeister (WMDE): Don’t generate current content text twice [extensions/AbuseFilter] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/707021 [11:03:17] RECOVERY - memcached socket on mw1443 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [11:03:29] RECOVERY - memcached socket on mw1445 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [11:03:30] (03PS2) 10Lucas Werkmeister (WMDE): Don’t generate current content text twice [extensions/AbuseFilter] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/707021 [11:05:54] (03PS3) 10Jbond: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:07:19] (03CR) 10jerkins-bot: [V: 04-1] Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:09:41] (03PS4) 10Jbond: Update TLS configuration for analytics-test-presto [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [11:11:05] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw144[3-6].eqiad.wmnet [11:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:30] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1443.eqiad.wmnet [11:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:21] (03PS2) 10Hashar: gerrit: remove SMTP encryption option [puppet] - 10https://gerrit.wikimedia.org/r/706043 (https://phabricator.wikimedia.org/T287122) [11:12:43] (03CR) 10Hashar: "Fixed a typo in commit message pointed by Ahmon and rebased the change." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706043 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [11:13:09] mutante: guten tag. Thx for the gerrit config change! I have another trivial one https://gerrit.wikimedia.org/r/c/operations/puppet/+/706043 ;) [11:13:27] it drops an unused options which we will never have to use [11:17:01] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1445.eqiad.wmnet [11:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1446.eqiad.wmnet [11:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:03] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [11:19:24] (03PS3) 10Jbond: debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 [11:20:02] hashar: yep, I already saw that and the comment, will do that [11:20:26] merged the ne from 2018 as well but we never got to use it..yea [11:20:37] (03PS4) 10Jbond: debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 [11:20:53] (03CR) 10Dzahn: [C: 03+2] gerrit: remove SMTP encryption option [puppet] - 10https://gerrit.wikimedia.org/r/706043 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [11:20:55] (03CR) 10Jbond: debian::autostart: update autostart to use custom policy-rc.d script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [11:22:10] (03CR) 10jerkins-bot: [V: 04-1] debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [11:22:49] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:54] mutante: i that smtp encryption was to have gerrit speak to mail relay potentially cross DC. Nowadays we can just use "localhost" and magic happens [11:23:09] which as an app maintainer is a blessing: one less thing to manage [11:24:16] (03PS2) 10Hashar: gerrit: listen on all address with iptables rule [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) [11:25:10] the last one I have in the chain, it might be fine. Not sure whether it should be pushed on a friday though [11:25:11] that's right, it was h.erron and done as part of https://phabricator.wikimedia.org/T175361 [11:25:37] the mail localhost part [11:26:44] yes, I saw the third one and I'd rather not merge that on Friday and given the history of this week [11:26:56] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10jbond) cleared mine down to ~650M [11:28:42] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add mcrouter certs for mw1422.eqiad.wmnet to mw1446.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/707300 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [11:32:50] (03CR) 10Hashar: gerrit: listen on all address with iptables rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [11:33:49] mutante: that 3rd one I believe it is fine, but indeed is a risky one. So indeed I am in favor of doing that next week :] [11:35:04] (03CR) 10Muehlenhoff: debian::autostart: update autostart to use custom policy-rc.d script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [11:38:27] hashar: ACK, so mote it be [11:39:07] will look at that closer after weekend [11:41:12] (03CR) 10Hashar: "The whole idea is to push the restriction at the OS / iptables level instead of having to mess with it in Gerrit. Listening on all inter" [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [11:47:13] (03Abandoned) 10Dzahn: site/conftool: add mw1439, mw1440 as jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/705927 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [11:47:20] (03CR) 10Ssingh: [C: 03+2] auditd: initial commit for the auditd module. [puppet] - 10https://gerrit.wikimedia.org/r/705696 (owner: 10Ssingh) [11:49:11] (03PS2) 10Ssingh: wikidough: add motd script to indicate logging of root commands [puppet] - 10https://gerrit.wikimedia.org/r/706722 [11:50:20] !log Change innodb_checksum_algorithm to full_crc32 on pc1011-1014 and pc2011-2014 - T287244 [11:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:29] T287244: Considering switching innodb_checksum_algorithm=full_crc32 - https://phabricator.wikimedia.org/T287244 [11:50:55] (03CR) 10Ssingh: [C: 03+2] wikidough: add motd script to indicate logging of root commands [puppet] - 10https://gerrit.wikimedia.org/r/706722 (owner: 10Ssingh) [11:53:57] (03PS2) 10Ssingh: rsyslog: send auditd/audispd logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/705707 [11:55:19] (03PS5) 10Jbond: debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 [11:56:15] RECOVERY - mediawiki-installation DSH group on mw1445 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:56:47] (03CR) 10jerkins-bot: [V: 04-1] debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [11:58:37] (03PS6) 10Jbond: debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 [11:58:48] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [12:00:09] (03CR) 10jerkins-bot: [V: 04-1] debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [12:03:01] (03Abandoned) 10Joal: [WIP] Add Gobblin modules [puppet] - 10https://gerrit.wikimedia.org/r/699770 (owner: 10Joal) [12:04:50] (03CR) 10David Caro: [C: 03+1] profile: wait for apache2 in wmcs::instance sites-local [puppet] - 10https://gerrit.wikimedia.org/r/707297 (owner: 10Filippo Giunchedi) [12:07:47] PROBLEM - Check systemd state on doh2002 is CRITICAL: CRITICAL - degraded: The following units failed: auditd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:41] ^ looking [12:09:53] hashar: out for now (working part time) have a good weekend [12:10:18] (added some new appservers/API servers in eqiad and all is quiet) [12:12:47] should be resolved, it had failed to install the auditd package, running agent fixed it again [12:13:29] RECOVERY - Check systemd state on doh2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:31] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [12:15:39] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw1439.eqiad.wmnet with reason: setup new canary mw api servers in eqiad D8 https://phabricator.wikimedia.org/T279309 [12:15:39] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw1439.eqiad.wmnet with reason: setup new canary mw api servers in eqiad D8 https://phabricator.wikimedia.org/T279309 [12:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:47] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [12:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:12] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw[1440-1442].eqiad.wmnet with reason: setup new canary mw api servers in eqiad D8 https://phabricator.wikimedia.org/T279309 [12:16:13] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw[1440-1442].eqiad.wmnet with reason: setup new canary mw api servers in eqiad D8 https://phabricator.wikimedia.org/T279309 [12:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:36] (03CR) 10David Caro: [C: 03+2] wmcs.puppet_alert: Add more info and differentiate cases [puppet] - 10https://gerrit.wikimedia.org/r/702331 (https://phabricator.wikimedia.org/T285839) (owner: 10David Caro) [12:17:59] (03CR) 10Jelto: [C: 03+2] site/conftool: add mw1439,mw1440,mw1441,mw1442 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [12:18:05] sukhe: what was the error, does auditd maybe need some additional debconf setting? [12:19:06] moritzm: it seems like it was only on doh2002 and now 3002, 2001 and 3001 seem to be ok!? [12:19:09] checking [12:19:11] https://puppetboard.wikimedia.org/report/doh3002.wikimedia.org/0120f983e56dbfd4ff2b997e0af5f658ad155ee6 [12:20:20] https://puppetboard.wikimedia.org/report/doh3001.wikimedia.org/44b41dcaed577ea20f0a37352bd6a34c528e4963 3001 is fine, with the same change [12:20:43] having a look at 3002 [12:21:22] I think it's https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=962451 [12:22:41] PROBLEM - Check systemd state on doh3002 is CRITICAL: CRITICAL - degraded: The following units failed: auditd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:44] yeah, sounds like it [12:22:58] PROBLEM - DPKG on doh3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:23:58] (03PS2) 10Jelto: site/conftool: add mw1439,mw1440,mw1441,mw1442 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) [12:24:02] the bugreport states: "that the timouts also happen randomly at restarts of the service", which would be inconvenient... [12:24:19] yeah sigh. off to a great start :) [12:24:29] but let's open a task for now and if there's on ongoing pattern we can backport the listed patch [12:24:33] RECOVERY - Check systemd state on doh3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:45] and it it works fine for us, try to get it into a buster point release [12:25:03] ok thanks, sounds like a plan [12:25:16] (03PS1) 10David Caro: cloud: fix condition on too old puppet run alert [puppet] - 10https://gerrit.wikimedia.org/r/707335 [12:25:18] (03CR) 10jerkins-bot: [V: 04-1] site/conftool: add mw1439,mw1440,mw1441,mw1442 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [12:25:35] (03CR) 10David Caro: [C: 03+2] cloud: fix condition on too old puppet run alert [puppet] - 10https://gerrit.wikimedia.org/r/707335 (owner: 10David Caro) [12:26:19] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: fix lookup of public_domain in frontend [puppet] - 10https://gerrit.wikimedia.org/r/707296 (owner: 10Filippo Giunchedi) [12:28:21] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:39] (03CR) 10David Caro: [C: 03+2] cloud dev - hiera: add wmflib::expand_path to codfw1dev hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [12:29:49] (03CR) 10David Caro: cloud dev - hiera: add wmflib::expand_path to codfw1dev hiera [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [12:45:26] (03CR) 10David Caro: [C: 03+2] wmcs-dns-floating-ip-updater: do a more granular retry [puppet] - 10https://gerrit.wikimedia.org/r/701506 (https://phabricator.wikimedia.org/T285537) (owner: 10David Caro) [12:49:46] 10SRE, 10SRE-Access-Requests: Issues with server access, assistance requested - https://phabricator.wikimedia.org/T287245 (10RhinosF1) Hi, You should copy the config from https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_your_SSH_config The correct bastion would now be bast4003. You'll then b... [12:53:49] RECOVERY - DPKG on doh3002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:54:09] (03PS7) 10Jbond: debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 [12:54:57] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [12:59:31] (03CR) 10Majavah: [C: 03+1] "Thanks! I think this fixes T283531." [puppet] - 10https://gerrit.wikimedia.org/r/707297 (owner: 10Filippo Giunchedi) [13:00:55] (03CR) 10Jbond: [V: 03+1] cloud dev - hiera: add wmflib::expand_path to codfw1dev hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [13:05:14] (03CR) 10Jelto: [C: 03+2] acme_chief: add gitlab2001 to acl for gitlab certificate [puppet] - 10https://gerrit.wikimedia.org/r/706339 (owner: 10Jbond) [13:09:00] (03CR) 10Jbond: [C: 03+1] "lgtm from a python PoV not too failure with this code though" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/706501 (https://phabricator.wikimedia.org/T284213) (owner: 10David Caro) [13:10:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, a few typos inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [13:13:37] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:15] PROBLEM - tilerator on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:14:23] 10SRE, 10SRE-Access-Requests: Issues with server access, assistance requested - https://phabricator.wikimedia.org/T287245 (10Reedy) https://wikitech.wikimedia.org/wiki/Bastion You're definitely using the wrong bastion, 4001 has been gone for a while. Based on your location, you might want to use `bast1003.wik... [13:17:14] 10SRE, 10SRE-Access-Requests: Issues with server access, assistance requested - https://phabricator.wikimedia.org/T287245 (10Aklapper) > I was granted server access in 2018 as karen For the records, that was {T201668}. > 4001 has been gone for a while I'd love decommissioning workflow to set `{{obsolete}}` on... [13:19:33] (03CR) 10Ottomata: Add stream configuration for ContentTranslation events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704456 (https://phabricator.wikimedia.org/T281982) (owner: 10KartikMistry) [13:21:47] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Christina Macholan - https://phabricator.wikimedia.org/T287233 (10Aklapper) 05Open→03Stalled Hi @CMacholan, thanks for taking the time to report this and welcome to Wikimedia Phabricator! Please use the template linked from https://phabricator... [13:26:12] (03PS3) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [13:26:15] 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10Ottomata) Down to 1.5M, thanks! [13:31:52] !log otto@deploy1002 Started deploy [analytics/refinery@15521b3]: Add property disabling gobblin lock - T271232 [13:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:00] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:34:16] (03PS4) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [13:34:27] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:01] RECOVERY - tilerator on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:35:24] !log otto@deploy1002 Finished deploy [analytics/refinery@15521b3]: Add property disabling gobblin lock - T271232 (duration: 03m 32s) [13:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:06] (03CR) 10Elukey: [V: 03+1 C: 03+2] "Going to merge this to start collecting metrics, if the approach is not ok I'll refactor :)" [puppet] - 10https://gerrit.wikimedia.org/r/707235 (owner: 10Elukey) [13:42:56] (03CR) 10Herron: [C: 03+1] "LGTM as long as its known/accepted that these logs will be viewable by users in the nda and wmf ldap groups" [puppet] - 10https://gerrit.wikimedia.org/r/705707 (owner: 10Ssingh) [13:43:39] (03CR) 10Jbond: [C: 03+2] debian::autostart: update autostart to use custom policy-rc.d script [puppet] - 10https://gerrit.wikimedia.org/r/706581 (owner: 10Jbond) [13:44:12] (03CR) 10Ssingh: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/705707 (owner: 10Ssingh) [13:46:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30313/console" [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [13:49:57] (03PS5) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [13:54:56] (03PS10) 10Aklapper: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [13:56:54] (03PS1) 10Jelto: add gitlab2001 to host_vars and variables [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707350 (https://phabricator.wikimedia.org/T285867) [14:00:41] (03CR) 10Jelto: "@Brennen could you take a look? I added gitlab2001 and would like to rollout the ansible playbook on gitlab2001." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707350 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [14:06:00] (03CR) 10Herron: [C: 03+1] thanos: add rule to module/profile [puppet] - 10https://gerrit.wikimedia.org/r/706509 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [14:07:57] 10Puppet, 10SRE, 10Infrastructure-Foundations: Upgrade Puppet to 5.5.21 - https://phabricator.wikimedia.org/T248168 (10Aklapper) Upstream discussion "Puppet 5.5 EOL in November 2020": https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=950182 [14:09:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I like this a lot, but maybe we can just drop the /common/ from the path." [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [14:09:45] (03CR) 10Herron: [C: 03+1] hieradata: configure thanos rule hosts [puppet] - 10https://gerrit.wikimedia.org/r/706510 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [14:09:51] (03CR) 10Brennen Bearnes: [C: 04-1] add gitlab2001 to host_vars and variables (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707350 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [14:09:59] (03CR) 10Herron: [C: 03+1] role: activate thanos::rule profile on thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/706511 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [14:10:05] (03CR) 10Herron: [C: 03+1] prometheus: pull metrics from thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/706512 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [14:10:26] (03CR) 10Herron: [C: 03+1] thanos: query rule component too [puppet] - 10https://gerrit.wikimedia.org/r/706513 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [14:10:33] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [14:11:49] (03PS2) 10Jelto: add gitlab2001 to host_vars and variables [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707350 (https://phabricator.wikimedia.org/T285867) [14:12:43] (03PS1) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 [14:14:08] (03CR) 10jerkins-bot: [V: 04-1] os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff) [14:14:11] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] fix puma exporter listen address [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707236 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [14:16:30] !log gitlab1001: running ansible to deploy [[gerrit:707236|fix puma exporter listen address]] (T275170) [14:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:39] T275170: Define monitoring for gitlab - https://phabricator.wikimedia.org/T275170 [14:17:48] (03PS1) 10Kormat: admin: Upgrade name. [puppet] - 10https://gerrit.wikimedia.org/r/707377 [14:18:45] (03CR) 10Kormat: [C: 03+2] admin: Upgrade name. [puppet] - 10https://gerrit.wikimedia.org/r/707377 (owner: 10Kormat) [14:19:14] (03CR) 10Jbond: [V: 03+1] "FYI i have avoided merging this as im also considering a different approach" [puppet] - 10https://gerrit.wikimedia.org/r/702325 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [14:19:55] (03PS2) 10Filippo Giunchedi: profile: wait for apache2 in wmcs::instance sites-local [puppet] - 10https://gerrit.wikimedia.org/r/707297 (https://phabricator.wikimedia.org/T283531) [14:20:06] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/707297 (https://phabricator.wikimedia.org/T283531) (owner: 10Filippo Giunchedi) [14:28:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] http-fcgi: switch to json logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/707257 (https://phabricator.wikimedia.org/T285384) (owner: 10Giuseppe Lavagetto) [14:28:13] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] http-fcgi: switch to json logging format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/707257 (https://phabricator.wikimedia.org/T285384) (owner: 10Giuseppe Lavagetto) [14:36:46] <_joe_> !log rebuilding httpd-fcgi, mediawiki-http fixing logging T285384 [14:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:58] T285384: The mediawiki-webserver image should only log in json format - https://phabricator.wikimedia.org/T285384 [14:37:16] (03PS1) 10Ottomata: Revert "Revert "kafka - Use hardened_tls instead of java::security" [puppet] - 10https://gerrit.wikimedia.org/r/707025 [14:37:25] (03PS2) 10Ottomata: Revert "Revert "kafka - Use hardened_tls instead of java::security" [puppet] - 10https://gerrit.wikimedia.org/r/707025 [14:37:56] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "kafka - Use hardened_tls instead of java::security" [puppet] - 10https://gerrit.wikimedia.org/r/707025 (owner: 10Ottomata) [14:38:53] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30315/console" [puppet] - 10https://gerrit.wikimedia.org/r/707025 (owner: 10Ottomata) [14:40:59] (03PS3) 10Ottomata: Revert "Revert "kafka - Use hardened_tls instead of java::security" [puppet] - 10https://gerrit.wikimedia.org/r/707025 [14:42:28] (03CR) 10BryanDavis: [C: 03+1] wikireplica_dns: fix s7 web aliases [puppet] - 10https://gerrit.wikimedia.org/r/707274 (owner: 10Majavah) [14:44:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [14:50:31] thanks _joe_! [14:57:18] RECOVERY - Check systemd state on wtp1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:54] 10SRE, 10SRE-Access-Requests: Issues with server access, assistance requested - https://phabricator.wikimedia.org/T287245 (10RLazarus) p:05Triage→03Medium a:03RLazarus Claiming this as the SRE on clinic duty -- thanks all for the suggestions. I chatted with @Kbrown and we agreed troubleshooting this in... [15:10:04] (03PS1) 10Jbond: debian: drop metadata.json file [puppet] - 10https://gerrit.wikimedia.org/r/707395 [15:11:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30316/console" [puppet] - 10https://gerrit.wikimedia.org/r/707395 (owner: 10Jbond) [15:11:42] !log stop ml-serve-ctrl1001 + gnt-instance modify -t plain ml-serve-ctrl1001.eqiad.wmnet on ganeti1009 + start instance back - T287238 [15:11:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] debian: drop metadata.json file [puppet] - 10https://gerrit.wikimedia.org/r/707395 (owner: 10Jbond) [15:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:52] T287238: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 [15:16:24] (03CR) 10Jbond: "> Patch Set 5: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [15:16:37] (03CR) 10Bstorm: [C: 03+2] wikireplica_dns: fix s7 web aliases [puppet] - 10https://gerrit.wikimedia.org/r/707274 (owner: 10Majavah) [15:19:35] (03CR) 10Cwhite: global: add a simple requires.txt (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/707256 (owner: 10David Caro) [15:21:57] (03PS6) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [15:22:46] (03CR) 10jerkins-bot: [V: 04-1] profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) (owner: 10Jbond) [15:25:13] (03CR) 10Elukey: [C: 03+2] Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [15:25:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [15:26:35] (03CR) 10Ahmon Dancy: "Looks reasonable. Simpler. Do we know that `gerrit init` won't re-add the listenAdresss clause after this?" [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [15:34:40] (03PS1) 10Jbond: debian: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/707401 [15:35:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/707401 (owner: 10Jbond) [15:37:30] (03PS7) 10Jbond: profile: create in module data for profile [puppet] - 10https://gerrit.wikimedia.org/r/704333 (https://phabricator.wikimedia.org/T285539) [15:40:08] (03CR) 10Jbond: [C: 03+2] debian: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/707401 (owner: 10Jbond) [15:44:59] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:45:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:16] !log powerdown wdqs2002 for IDRAC reset [15:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:23] (03CR) 10Hashar: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [15:49:33] PROBLEM - Host wdqs2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:49] RECOVERY - Host wdqs2002 is UP: PING OK - Packet loss = 0%, RTA = 30.16 ms [16:01:02] (03PS1) 10Elukey: admin_ng: add knative-serving in bases list [deployment-charts] - 10https://gerrit.wikimedia.org/r/707408 (https://phabricator.wikimedia.org/T278194) [16:01:28] _joe_ a quick one if you have a minute --^ [16:03:45] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:27] PROBLEM - puppet last run on kubernetes1011 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:05:35] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:05:35] PROBLEM - puppet last run on kubernetes1008 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:05:53] PROBLEM - puppet last run on kubernetes1014 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:05:59] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:05:59] PROBLEM - puppet last run on kubernetes1007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:06:23] checking --^ [16:07:02] ah this seems to be related to dragonfly-testin [16:07:21] PROBLEM - puppet last run on kubernetes1013 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:07:33] PROBLEM - puppet last run on kubernetes1010 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:07:51] PROBLEM - puppet last run on kubestage1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:09:43] jayme: o/ any chance that you just renabled puppet on those? --^ [16:09:45] (03CR) 10Bstorm: [C: 03+2] metricsinfra: remove alertmanager from prometheus role [puppet] - 10https://gerrit.wikimedia.org/r/705632 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [16:09:47] puppet runs fine [16:10:11] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:10:13] PROBLEM - puppet last run on kubernetes1012 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:11:04] running puppet on kubernetes1* anyway to clear all these alerts [16:11:43] RECOVERY - puppet last run on kubernetes1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:11:51] RECOVERY - puppet last run on kubernetes1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:11:51] RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:12:50] ah yes I see jayme in cumin1001's logs :) [16:13:01] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:39] RECOVERY - puppet last run on kubestage1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:14:08] XioNoX: still waiting on PIC [16:14:32] (03CR) 10Bstorm: [C: 04-1] "Since I mostly rewrote the script yesterday, this cannot merge. I can try to add what you were aiming for. Sorry about that." [puppet] - 10https://gerrit.wikimedia.org/r/701515 (https://phabricator.wikimedia.org/T285537) (owner: 10David Caro) [16:14:54] * elukey waves to papaul [16:15:07] elukey: hello [16:15:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: SSH failure for wdqs2002.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T287112 (10Papaul) 05Open→03Resolved Reset IDRAC , server is back up [16:15:45] !log enable puppet on mc-gp* hosts [16:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:01] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:16:01] RECOVERY - puppet last run on kubernetes1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:08] RECOVERY - puppet last run on kubernetes1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:15] RECOVERY - puppet last run on kubernetes1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:15] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:20] (03CR) 10Bstorm: [C: 04-1] "Actually, we should probably wait to see if the openstacksdk is more reliable at session-passing than novaclient. Either way, I'll add you" [puppet] - 10https://gerrit.wikimedia.org/r/701515 (https://phabricator.wikimedia.org/T285537) (owner: 10David Caro) [16:19:01] RECOVERY - puppet last run on kubernetes1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:11] RECOVERY - puppet last run on kubernetes1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:35] PROBLEM - puppet last run on mc-gp1002 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:24:35] RECOVERY - puppet last run on mc-gp1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:29:04] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Christina Macholan - https://phabricator.wikimedia.org/T287233 (10RLazarus) p:05Triage→03Medium [16:31:46] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Christina Macholan - https://phabricator.wikimedia.org/T287233 (10RLazarus) Hi Christina, welcome to the Foundation! @Aklapper is correct about the format we like to use for these requests -- I've edited the task for you, so all I need the inform... [16:33:10] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10RLazarus) p:05Triage→03Medium [16:36:20] 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10RLazarus) p:05Triage→03Medium [16:43:01] Papaul: let me know when the PIC arrives. I'm gonna try to bring it online for the practice, and will reach out to XioNoX if any issues/unsure of anything. [16:43:55] 10SRE, 10ops-eqiad, 10DC-Ops: Relabel dbstore1004 to db1183 - https://phabricator.wikimedia.org/T286468 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [16:49:02] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: listen on all address with iptables rule [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [17:06:15] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) [17:07:07] 10SRE, 10Traffic: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10ssingh) p:05Triage→03Low [17:45:43] (03PS1) 10Legoktm: sre.switchdc.mediawiki: Run the warmup cache script at least 6 times [cookbooks] - 10https://gerrit.wikimedia.org/r/707457 (https://phabricator.wikimedia.org/T285802) [17:46:43] (03PS2) 10Legoktm: sre.switchdc.mediawiki: Run the warmup cache script at least 6 times [cookbooks] - 10https://gerrit.wikimedia.org/r/707457 (https://phabricator.wikimedia.org/T285802) [17:49:40] (03PS1) 10Legoktm: Bump shellbox to 2021-07-23-172126-score [deployment-charts] - 10https://gerrit.wikimedia.org/r/707458 (https://phabricator.wikimedia.org/T287212) [17:57:59] (03PS2) 10Legoktm: Bump shellbox to 2021-07-23-172126-score [deployment-charts] - 10https://gerrit.wikimedia.org/r/707458 (https://phabricator.wikimedia.org/T287212) [18:04:36] (03CR) 10Legoktm: [C: 03+2] Bump shellbox to 2021-07-23-172126-score [deployment-charts] - 10https://gerrit.wikimedia.org/r/707458 (https://phabricator.wikimedia.org/T287212) (owner: 10Legoktm) [18:04:43] (03PS1) 10Bstorm: cloud dns: tidy up the labs-ip-alias-dump script [puppet] - 10https://gerrit.wikimedia.org/r/707478 (https://phabricator.wikimedia.org/T285537) [18:05:38] (03CR) 10Bstorm: "I tested that this doesn't introduce new behavior or error out in my home dir on cloudservices1003" [puppet] - 10https://gerrit.wikimedia.org/r/707478 (https://phabricator.wikimedia.org/T285537) (owner: 10Bstorm) [18:06:47] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [18:07:20] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/707484 [18:07:22] (03Merged) 10jenkins-bot: Bump shellbox to 2021-07-23-172126-score [deployment-charts] - 10https://gerrit.wikimedia.org/r/707458 (https://phabricator.wikimedia.org/T287212) (owner: 10Legoktm) [18:12:56] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:16] !log Turning up et-0/0/[0-1] and et-0/2/[0-1] interfaces on cr2-codfw after line card replacement slot 0. [18:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:03] RECOVERY - OSPF status on mr1-codfw is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:15:36] (03PS1) 10Ottomata: airflow - set default smtp settings [puppet] - 10https://gerrit.wikimedia.org/r/707489 (https://phabricator.wikimedia.org/T285692) [18:16:05] (03CR) 10jerkins-bot: [V: 04-1] airflow - set default smtp settings [puppet] - 10https://gerrit.wikimedia.org/r/707489 (https://phabricator.wikimedia.org/T285692) (owner: 10Ottomata) [18:17:57] (03PS2) 10Ottomata: airflow - set default smtp settings [puppet] - 10https://gerrit.wikimedia.org/r/707489 (https://phabricator.wikimedia.org/T285692) [18:18:49] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30318/console" [puppet] - 10https://gerrit.wikimedia.org/r/707489 (https://phabricator.wikimedia.org/T285692) (owner: 10Ottomata) [18:21:11] (03PS3) 10Ottomata: airflow - set default smtp settings [puppet] - 10https://gerrit.wikimedia.org/r/707489 (https://phabricator.wikimedia.org/T285692) [18:22:09] (03CR) 10Ottomata: [C: 03+2] airflow - set default smtp settings [puppet] - 10https://gerrit.wikimedia.org/r/707489 (https://phabricator.wikimedia.org/T285692) (owner: 10Ottomata) [18:24:35] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:43] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:26:57] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:22] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 (10cmooney) @Papaul replaced card and interfaces have been switched up. All seems ok. ` cmooney@re0.cr2-codfw> show chassis fpc pic-status 0 Slot 0 Online MPCE Type 3... [18:52:04] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) [18:53:08] (03PS1) 10Cathal Mooney: Revert "Re-depool eqiad" [dns] - 10https://gerrit.wikimedia.org/r/707427 [18:53:47] (03CR) 10RLazarus: [C: 03+1] Revert "Re-depool eqiad" [dns] - 10https://gerrit.wikimedia.org/r/707427 (owner: 10Cathal Mooney) [18:55:19] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) OK, we're now running lilypond 2.22.0 which should make some more things available in safe mode. Nu... [18:58:18] (03PS2) 10Cathal Mooney: Revert "Re-depool eqiad" [dns] - 10https://gerrit.wikimedia.org/r/707427 [18:59:31] (03CR) 10Cathal Mooney: [C: 03+2] Revert "Re-depool eqiad" [dns] - 10https://gerrit.wikimedia.org/r/707427 (owner: 10Cathal Mooney) [19:02:09] !log De-pooling eqiad again after successful replacement of linecard in cr2-codfw T287110 [19:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:18] T287110: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 [19:03:40] topranks: should that be "repooling"? [19:04:02] It should yes. [19:04:24] Sorry.... the text in the gerrit commit msg is confusing me with the double negative. [19:04:32] Will I log another message to clarify... my bad sry [19:04:45] re-un-de-non-antipooling eqiad :D [19:04:53] yeah, can either do that or just edit it directly on wikitech [19:04:55] makes sense to me :D [19:05:05] let me do the latter seems less confusing. [19:05:59] I would suggest re !log-ing since it also goes to other places like sal.toolforge.org, Twitter, Mastodon, ... [19:06:29] yeah, was about to say, there are a couple of places that won't get updated, but officially the SAL on wikitech is canonical [19:06:43] legoktm: ok thanks. I've already edited on wikitech. [19:06:46] if it were something really important I'd say definitely log again, in this case I think it's fine either way 🤷 [19:06:54] also fine to do both just to be safe, certainly [19:06:59] what's best? log another msg? undo the edit back, and then log another msg? [19:07:25] authdns-update running now btw [19:08:35] definitely keep the edit IMO, feel free to also log again to clarify - if lego thinks you should, that's a good enough reason for me [19:09:09] ok yeah probably best. [19:09:13] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 66 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:09:47] ^ man, you network folks never get to deal with one thing at a time, do you [19:11:31] !log Successfully re-pooled eqiad - reversed change from yesterday after successful line card replacement in cr2-codfw - T287110 [19:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:37] T287110: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 [19:11:46] rzl: oh man [19:12:27] (eqiad Varnish traffic is starting to creep up, as expected 👍) [19:13:29] cool yep see those graphs :) [19:15:09] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 42 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:16:31] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 53.59 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:17:27] ^ expected [19:17:54] yep cool... those graphs are a little hard to wrap your head around but I'm getting used to them. [19:18:36] yeah, the log-scale percentage graph linked from that alert never really does much for me [19:18:54] but the stacked graph at https://grafana.wikimedia.org/d/000000093/varnish-traffic?orgId=1&from=now-30m&to=now is great for verifying that the sum hasn't changed [19:19:45] ah ok that's nice. Yeah I'm a fan of stacked graphs like that for validating that upswing_here == downswing_there. I'll bookmark that. [19:20:57] yeah, https://grafana.wikimedia.org/d/000000180/varnish-http-requests also has stacked graphs in requests/min rather than bytes/sec [19:22:09] RIPE stats to ulsfo back to usual as well. [19:22:16] * topranks exhales [19:22:37] (all of this for varnish-fe rather than ats-tls just because I happen to know where the graphs are, so technically we're looking at the middle of the stack, traffic folks might have different dashboards they habitually check - but in practice for this purpose I think it's fine) [19:23:13] yeah, looks good! and we're past 10 minutes, so the well-behaved DNS caches are caught up, just the long tail now [19:23:14] two sides of the same coin I guess, unless they are down and we need to know which layer is doing what? [19:23:17] we should be all set :) [19:23:42] great [19:23:51] yeah, exactly - there are failure modes where ats-tls is seeing traffic that varnish-fe isn't, but very few correct cases where that happens [19:24:02] makes sense [19:24:10] and definitely not cases where varnish-fe is seeing traffic that ats-tls isn't [19:24:27] yep [19:25:51] thanks for doing that, nice to have it back online for the weekend [19:26:54] yeah great. thanks a bunch for your help, I would have struggled otherwise, certainly been very worried I was doing the wrong thing! [19:27:09] well, as a resident of the southeast US it's like a 5 ms faster RTT for me [19:27:13] so I was definitely serving my own interests [19:30:01] haha... well in that case very happy to oblige :) [19:34:36] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 (10cmooney) Everything still looking good, eqiad re-pooled and combined stats across sites as they were but eqiad back in the pool. Resolving task. [19:34:58] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 (10cmooney) 05Open→03Resolved [19:40:01] (03CR) 10Ottomata: [C: 03+1] Update TLS configuration for analytics-test-presto (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706661 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [19:45:21] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:03:26] (03PS1) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:04:01] (03CR) 10jerkins-bot: [V: 04-1] Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [20:06:12] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10HirnSpuk) Hi everyone, I had an eye on here for a contributor in german wikibooks. I kept him informed about... [20:12:22] (03PS2) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:12:53] (03PS3) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:12:55] (03CR) 10jerkins-bot: [V: 04-1] Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [20:13:23] (03CR) 10jerkins-bot: [V: 04-1] Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [20:13:33] (03PS4) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:15:34] (03PS5) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:20:44] (03PS6) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:26:43] (03PS7) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:38:52] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) >>! In T257066#7233684, @HirnSpuk wrote: > Hi everyone, I had an eye on here for a contributor in g... [20:39:39] (03PS1) 10Bstorm: toolforge harbor: puppetize experimental base server for harbor [puppet] - 10https://gerrit.wikimedia.org/r/707572 (https://phabricator.wikimedia.org/T267616) [20:41:34] (03CR) 10Bstorm: [C: 03+2] toolforge harbor: puppetize experimental base server for harbor [puppet] - 10https://gerrit.wikimedia.org/r/707572 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [20:54:23] (03PS8) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:54:54] (03CR) 10jerkins-bot: [V: 04-1] Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [20:55:46] (03PS9) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [20:55:49] I just logged into mwmaint1002 and all my files are gone. There are year-old copies on mwmaint2002, but I had a lot of new stuff on mwmaint1002. Is there anyway to get those files back? [20:56:12] (Or maybe everything got copied somewhere else?) [21:01:18] (03PS10) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [21:02:05] Trey314159: yeah, mwmaint was reimaged but homedirs were backed up on Bacula, so we should be able to pull your stuff out of there [21:02:26] rzl: cool! [21:03:31] Should I open a ticket in phab, or is it quick'n'easy? [21:04:07] checking :) it's definitely quick and easy for the right person, but that person isn't me [21:04:25] so let me see if it's *also* quick and easy for me, if not I'll have to point you at phab [21:04:54] thanks [21:06:32] (03PS11) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [21:08:28] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30330/console" [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [21:09:39] Trey314159: did you want everything under ~tjones, or a subset? [21:10:20] I think there's just one directory, "reindex". I'd like that back [21:10:25] 👍 [21:10:34] this'll be a snapshot as of July 13, just before it was reimaged, if that suits you [21:12:11] okay, restoring, stand by 🤞 [21:12:17] sounds great! [21:12:23] (03PS12) 10Ottomata: Deprecate profile::analytics::cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) [21:15:42] (03CR) 10Ottomata: "PCC looks about right to me. A few questions and comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/707564 (https://phabricator.wikimedia.org/T287063) (owner: 10Ottomata) [21:15:52] Trey314159: ah okay, so: the backup is 100% intact, nothing to worry about, but restoring it is a little trickier in this case, since the host was reimaged -- normally the backups are decrypted with a host key, but since that key is gone, we'll decrypt using the master key instead [21:16:07] Trey314159: that process is a little more involved and I'd rather not do it on a Friday afternoon when our (EU timezone) expert is already offline for the weekend [21:16:41] so, let's fail over to a phab task after all, and we should be able to pull your data out first thing next week -- sorry for the false hope :) will that timeline be okay for you? [21:16:53] rzl: that's fair. I think I can cobble together a ticket from the info here. [21:17:03] next week will do. Thanks for all the help! [21:19:04] Trey314159: if you tag it with #Data-Persistence-Backup it'll reach the right people -- just mention the hostname, path to restore, and if you want to make their lives easier you can mention it was reimaged at 2021-07-13 11:13 [21:19:40] Will do! [21:21:17] I think we should make sure to rsync home dirs from maint hosts before the switchover [21:23:07] mm yeah, we did that the last time we reimaged them -- I think m.utante's theory was we didn't need to do that this time since we have bacula now, but in practice it seems like that's still fairly disruptive [21:23:32] but that's before a reimage -- I'm not sure if we need to do it before each switchover [21:24:27] I think it would make the host switch less disruptive [21:24:40] I guess the idea is if everyone is hopping over to codfw, all their stuff can be there waiting for them -- yeah [21:24:42] writing up some more coherent reasoning in phab [21:24:43] I could see that [21:25:42] legoktm: that'd be great! Thanks! [21:28:32] 10SRE, 10Datacenter-Switchover: Add step to rsync home dirs on mwmaint hosts before DC switchover - https://phabricator.wikimedia.org/T287303 (10Legoktm) [21:28:39] rzl: ^ [21:29:11] 👍 [21:36:06] Trey314159: seen, thanks -- on the offchance you don't hear back as quickly as I expect, please do ping me and I'll follow up [21:36:13] sorry for the inconvenience here [21:37:17] Thanks, rzl—I knew Friday afternoon was not the best time to be asking for anything, so I do appreciate the help! [21:38:22] I would also add that it's very trivial to puppetize extra scripts you need on the mwmaint hosts [21:42:05] legoktm: Appreciated, but it's trivial for you, not for me. I have abstract knowledge of puppet, but no practical knowledge. I also have some data from recent reindexing runs that's useful for debugging when things don't work as expected, etc. I just have to pay more attention to switchovers (and root for your new ticket to get implemented!) [21:42:41] that's fair [21:44:42] if you drop something in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/files/home/ it'll be present in your home directory on *all* hosts you have access to [21:45:00] and then https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/mediawiki/files/maintenance/ is a place to put stuff specifically for mwmaint [21:45:34] that doesn't take care of the data though, which the rsync would handle [21:47:33] 10SRE, 10Data-Persistence, 10Data-Persistence-Backup: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10RLazarus) My naive attempt at https://wikitech.wikimedia.org/wiki/Bacula#Restore_(aka_Panic_mode) went fine until the decryption phase, at which point "Error... [21:58:59] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 80 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:04:55] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 40 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:20:54] 10SRE, 10Data-Persistence, 10Data-Persistence-Backup: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10RLazarus) (Oh, and the timestamp came from T267607#7208278.) [23:20:19] (03PS12) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [23:26:58] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [23:31:11] (03PS13) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [23:41:05] (03CR) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)