2023-03-27 02:06:45
|
<jinxer-wm>
|
(JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
2023-03-27 02:09:28
|
<wikibugs>
|
'ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (''Andrew) @Jclark-ctr it'll be another week or two before we have workloads moved off of this.'
|
2023-03-27 02:26:45
|
<jinxer-wm>
|
(JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
2023-03-27 02:29:39
|
<jinxer-wm>
|
(NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
|
2023-03-27 05:10:09
|
<wikibugs>
|
('PS2) ''KartikMistry: Update cxserver to 2023-03-17-133444-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379)'
|
2023-03-27 05:13:01
|
<wikibugs>
|
('PS3) ''Marostegui: mariadb: Promote db1101 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510)'
|
2023-03-27 05:14:21
|
<logmsgbot>
|
!log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510
|
2023-03-27 05:14:27
|
<stashbot>
|
T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
|
2023-03-27 05:14:37
|
<logmsgbot>
|
!log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510
|
2023-03-27 05:16:56
|
<kart_>
|
Updating cxserver, minor changes.
|
2023-03-27 05:18:10
|
<wikibugs>
|
('CR) ''KartikMistry: [C: ''+2] Update cxserver to 2023-03-17-133444-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) (owner: ''KartikMistry)'
|
2023-03-27 05:19:04
|
<wikibugs>
|
('PS1) ''Marostegui: db1179: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/902902 (https://phabricator.wikimedia.org/T332292)'
|
2023-03-27 05:19:35
|
<wikibugs>
|
('PS15) ''KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - ''https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505)'
|
2023-03-27 05:19:41
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T332292', diff saved to https://phabricator.wikimedia.org/P45942 and previous config saved to /var/cache/conftool/dbconfig/20230327-051941-root.json
|
2023-03-27 05:19:46
|
<stashbot>
|
T332292: Move db1179 to x1 - https://phabricator.wikimedia.org/T332292
|
2023-03-27 05:19:53
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] db1179: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/902902 (https://phabricator.wikimedia.org/T332292) (owner: ''Marostegui)'
|
2023-03-27 05:22:56
|
<wikibugs>
|
('Merged) ''jenkins-bot: Update cxserver to 2023-03-17-133444-production [deployment-charts] - ''https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) (owner: ''KartikMistry)'
|
2023-03-27 05:23:42
|
<wikibugs>
|
('PS1) ''Marostegui: mariadb: Move db1179 to x1 [puppet] - ''https://gerrit.wikimedia.org/r/902903 (https://phabricator.wikimedia.org/T332292)'
|
2023-03-27 05:23:47
|
<logmsgbot>
|
!log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
|
2023-03-27 05:24:14
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] mariadb: Move db1179 to x1 [puppet] - ''https://gerrit.wikimedia.org/r/902903 (https://phabricator.wikimedia.org/T332292) (owner: ''Marostegui)'
|
2023-03-27 05:24:27
|
<logmsgbot>
|
!log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
|
2023-03-27 05:28:00
|
<logmsgbot>
|
!log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
|
2023-03-27 05:28:52
|
<logmsgbot>
|
!log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
|
2023-03-27 05:37:57
|
<logmsgbot>
|
!log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
|
2023-03-27 05:38:42
|
<logmsgbot>
|
!log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
|
2023-03-27 05:40:49
|
<kart_>
|
!log Updated cxserver to 2023-03-17-133444-production (T332379 + build changes)
|
2023-03-27 05:40:53
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 05:40:54
|
<stashbot>
|
T332379: Post-creation work for anpwiki - https://phabricator.wikimedia.org/T332379
|
2023-03-27 05:57:34
|
<wikibugs>
|
('PS1) ''KartikMistry: Enable Section Translation on some wikis while Content Translation remains in beta [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834)'
|
2023-03-27 06:19:47
|
<wikibugs>
|
('CR) ''Krinkle: Fix PHP string interpolation (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: ''Reedy)'
|
2023-03-27 06:29:39
|
<jinxer-wm>
|
(NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
|
2023-03-27 06:36:42
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45944 and previous config saved to /var/cache/conftool/dbconfig/20230327-063642-root.json
|
2023-03-27 06:40:20
|
<marostegui>
|
!log Rename flaggedrevs tables on db1123 ptwikisource T332594
|
2023-03-27 06:40:24
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 06:40:25
|
<stashbot>
|
T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594
|
2023-03-27 06:51:47
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45945 and previous config saved to /var/cache/conftool/dbconfig/20230327-065147-root.json
|
2023-03-27 06:51:53
|
<marostegui>
|
!log dbmaint s3 eqiad Rename flaggedrevs tables on db1123 ptwikisource T332594
|
2023-03-27 06:51:57
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 06:51:58
|
<stashbot>
|
T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594
|
2023-03-27 06:54:22
|
<wikibugs>
|
('PS1) ''Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)'
|
2023-03-27 07:00:05
|
<jouncebot>
|
Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T0700).
|
2023-03-27 07:00:05
|
<jouncebot>
|
No Gerrit patches in the queue for this window AFAICS.
|
2023-03-27 07:06:49
|
<icinga-wm>
|
PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 07:06:52
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45946 and previous config saved to /var/cache/conftool/dbconfig/20230327-070651-root.json
|
2023-03-27 07:07:57
|
<icinga-wm>
|
PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 07:09:15
|
<wikibugs>
|
('PS1) ''Marostegui: backups: Replace db1164 with db1101 [puppet] - ''https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510)'
|
2023-03-27 07:12:08
|
<marostegui>
|
jynus: also that one ^ :)
|
2023-03-27 07:12:19
|
<jynus>
|
oh
|
2023-03-27 07:12:29
|
<jynus>
|
I forgot
|
2023-03-27 07:12:46
|
<jynus>
|
needs 2 changes actually
|
2023-03-27 07:12:59
|
<marostegui>
|
ah yes
|
2023-03-27 07:13:00
|
<marostegui>
|
I see it
|
2023-03-27 07:13:02
|
<marostegui>
|
let me fix it
|
2023-03-27 07:13:27
|
<wikibugs>
|
('PS2) ''Marostegui: backups: Replace db1164 with db1101 [puppet] - ''https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510)'
|
2023-03-27 07:13:29
|
<marostegui>
|
jynus: ^
|
2023-03-27 07:13:41
|
<wikibugs>
|
('PS4) ''Marostegui: mariadb: Promote db1101 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510)'
|
2023-03-27 07:13:47
|
<wikibugs>
|
('CR) ''Jcrespo: [C: ''+1] backups: Replace db1164 with db1101 [puppet] - ''https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
|
2023-03-27 07:14:06
|
<jynus>
|
one sec because I was looking and there are backups still running
|
2023-03-27 07:14:27
|
<marostegui>
|
sure no problem
|
2023-03-27 07:21:57
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45947 and previous config saved to /var/cache/conftool/dbconfig/20230327-072156-root.json
|
2023-03-27 07:30:28
|
<wikibugs>
|
('CR) ''Jcrespo: [C: ''+1] mariadb: Promote db1101 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
|
2023-03-27 07:32:29
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] mariadb: Promote db1101 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
|
2023-03-27 07:33:11
|
<wikibugs>
|
'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (''Joe)'
|
2023-03-27 07:34:11
|
<urbanecm>
|
goes to do some MW deployment, since B&C is empty
|
2023-03-27 07:34:16
|
<wikibugs>
|
('CR) ''Urbanecm: [C: ''+2] SpecialWikiSets: Avoid calling WikiSet::getId on null [extensions/CentralAuth] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/902741 (https://phabricator.wikimedia.org/T333075) (owner: ''Urbanecm)'
|
2023-03-27 07:34:32
|
<wikibugs>
|
('CR) ''Urbanecm: [C: ''+2] GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) (owner: ''Urbanecm)'
|
2023-03-27 07:36:40
|
<wikibugs>
|
('Merged) ''jenkins-bot: SpecialWikiSets: Avoid calling WikiSet::getId on null [extensions/CentralAuth] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/902741 (https://phabricator.wikimedia.org/T333075) (owner: ''Urbanecm)'
|
2023-03-27 07:37:01
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45948 and previous config saved to /var/cache/conftool/dbconfig/20230327-073701-root.json
|
2023-03-27 07:38:50
|
<logmsgbot>
|
!log urbanecm@deploy2002 Started scap: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]]
|
2023-03-27 07:38:58
|
<stashbot>
|
T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
|
2023-03-27 07:39:50
|
<jynus>
|
!log disabling puppet and shutding down bacula at backup1001 T331510
|
2023-03-27 07:39:55
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 07:39:55
|
<stashbot>
|
T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
|
2023-03-27 07:41:52
|
<jynus>
|
a prometheus availability job will alert because of the above log, as the job only monitors that 1 host
|
2023-03-27 07:44:25
|
<wikibugs>
|
('PS1) ''Jcrespo: bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs [puppet] - ''https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896)'
|
2023-03-27 07:46:45
|
<jinxer-wm>
|
(JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
2023-03-27 07:48:21
|
<logmsgbot>
|
!log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
|
2023-03-27 07:48:26
|
<stashbot>
|
T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
|
2023-03-27 07:48:39
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''fundraising-tech-ops, ''netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (''ayounsi) ''Open→''Resolved a:''ayounsi Done!'
|
2023-03-27 07:51:39
|
<wikibugs>
|
('CR) ''Jcrespo: [C: ''+2] bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs [puppet] - ''https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) (owner: ''Jcrespo)'
|
2023-03-27 07:52:06
|
<logmsgbot>
|
!log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45949 and previous config saved to /var/cache/conftool/dbconfig/20230327-075206-root.json
|
2023-03-27 07:52:55
|
<wikibugs>
|
('Merged) ''jenkins-bot: GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) (owner: ''Urbanecm)'
|
2023-03-27 07:55:13
|
<icinga-wm>
|
RECOVERY - PHP7 rendering on parse2017 is OK: HTTP OK: HTTP/1.1 302 Found - 519 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
|
2023-03-27 07:55:36
|
<logmsgbot>
|
!log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] (duration: 16m 45s)
|
2023-03-27 07:55:41
|
<stashbot>
|
T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
|
2023-03-27 07:58:36
|
<logmsgbot>
|
!log urbanecm@deploy2002 Started scap: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]]
|
2023-03-27 07:58:41
|
<stashbot>
|
T331444: MediaWiki:GrowthMentors.json: Add a write-only username field - https://phabricator.wikimedia.org/T331444
|
2023-03-27 07:59:58
|
<logmsgbot>
|
!log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
|
2023-03-27 08:00:57
|
<icinga-wm>
|
RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2023-03-27 08:01:12
|
<wikibugs>
|
('CR) ''Tacsipacsi: [huwiki] Add Draft and Draft_talk namespaces (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
|
2023-03-27 08:02:04
|
<wikibugs>
|
('PS1) ''Ladsgroup: EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941)'
|
2023-03-27 08:02:27
|
<wikibugs>
|
('CR) ''Ladsgroup: [C: ''+2] EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) (owner: ''Ladsgroup)'
|
2023-03-27 08:02:53
|
<wikibugs>
|
('CR) ''Jelto: [V: ''+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40331/console"; [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659)
(owner: ''Hashar)'
|
2023-03-27 08:03:43
|
<marostegui>
|
!log Failover m1 from db1164 to db1101 - T331510
|
2023-03-27 08:03:48
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 08:03:49
|
<stashbot>
|
T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
|
2023-03-27 08:03:54
|
<urbanecm>
|
Amir1: fyi my scap backport's just about to finish
|
2023-03-27 08:04:14
|
<Amir1>
|
mine takes twenty minutes to merge, don't worry
|
2023-03-27 08:04:14
|
<marostegui>
|
all done jynus
|
2023-03-27 08:04:21
|
<urbanecm>
|
ok
|
2023-03-27 08:04:28
|
<jynus>
|
ok to merge the backup patches?
|
2023-03-27 08:04:47
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''+2] backups: Replace db1164 with db1101 [puppet] - ''https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
|
2023-03-27 08:05:02
|
<marostegui>
|
Etherpad looks fie
|
2023-03-27 08:05:03
|
<marostegui>
|
fine
|
2023-03-27 08:05:35
|
<jynus>
|
it is a bit slow for me
|
2023-03-27 08:05:53
|
<marostegui>
|
I guess it's warming up
|
2023-03-27 08:06:04
|
<marostegui>
|
I can open the test pad fine
|
2023-03-27 08:06:29
|
<logmsgbot>
|
!log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] (duration: 07m 52s)
|
2023-03-27 08:06:34
|
<stashbot>
|
T331444: MediaWiki:GrowthMentors.json: Add a write-only username field - https://phabricator.wikimedia.org/T331444
|
2023-03-27 08:06:51
|
<jynus>
|
it is ok for me now
|
2023-03-27 08:06:58
|
<jinxer-wm>
|
(KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 08:07:11
|
<jynus>
|
what else to test?
|
2023-03-27 08:07:30
|
<marostegui>
|
jynus: librenms, which also works fine for me
|
2023-03-27 08:07:31
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''Marostegui)'
|
2023-03-27 08:07:48
|
<jynus>
|
orch is complaining about lag, I guess not real?
|
2023-03-27 08:07:53
|
<marostegui>
|
reload :)
|
2023-03-27 08:08:18
|
<jynus>
|
still happening
|
2023-03-27 08:08:29
|
<urbanecm>
|
done
|
2023-03-27 08:08:31
|
<marostegui>
|
ah I know why
|
2023-03-27 08:09:14
|
<jynus>
|
cleanup of the table maybe?
|
2023-03-27 08:09:32
|
<wikibugs>
|
('PS1) ''Marostegui: db1101: Make it master [puppet] - ''https://gerrit.wikimedia.org/r/903181 (https://phabricator.wikimedia.org/T331510)'
|
2023-03-27 08:09:33
|
<marostegui>
|
jynus: nope, this ^
|
2023-03-27 08:09:37
|
<jynus>
|
I see
|
2023-03-27 08:09:44
|
<wikibugs>
|
('CR) ''Marostegui: [V: ''+2 C: ''+2] db1101: Make it master [puppet] - ''https://gerrit.wikimedia.org/r/903181 (https://phabricator.wikimedia.org/T331510) (owner: ''Marostegui)'
|
2023-03-27 08:10:40
|
<marostegui>
|
jynus: fixed!
|
2023-03-27 08:11:25
|
<jynus>
|
looking at the original path to see why I didn't see that
|
2023-03-27 08:11:29
|
<jynus>
|
*patch
|
2023-03-27 08:11:58
|
<jinxer-wm>
|
(KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 08:12:50
|
<jynus>
|
let me run puppet on backup hosts
|
2023-03-27 08:12:52
|
<wikibugs>
|
('CR) ''Jelto: [V: ''+1 C: ''+1] "lgtm, left one little question in-line" [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
|
2023-03-27 08:12:53
|
<jynus>
|
to apply the change
|
2023-03-27 08:16:40
|
<wikibugs>
|
'SRE, ''DBA, ''Data-Engineering, ''Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (''Marostegui)'
|
2023-03-27 08:17:27
|
<wikibugs>
|
('CR) ''Ladsgroup: mediawiki: Reduce the frequency of flaggedrevs updates (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: ''Ladsgroup)'
|
2023-03-27 08:17:29
|
<wikibugs>
|
('PS1) ''Marostegui: Revert "backups: Replace db1164 with db1101" [puppet] - ''https://gerrit.wikimedia.org/r/903188'
|
2023-03-27 08:17:32
|
<urbanecm>
|
rollouts one more change
|
2023-03-27 08:17:35
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C: ''+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) (owner: ''Urbanecm)'
|
2023-03-27 08:17:39
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''-2] "Wait for the failover date" [puppet] - ''https://gerrit.wikimedia.org/r/903188 (owner: ''Marostegui)'
|
2023-03-27 08:17:59
|
<wikibugs>
|
('Merged) ''jenkins-bot: EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) (owner: ''Ladsgroup)'
|
2023-03-27 08:18:18
|
<wikibugs>
|
('Merged) ''jenkins-bot: [Growth] eswiki: Enable mentorship for 50% of newcomers [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) (owner: ''Urbanecm)'
|
2023-03-27 08:18:25
|
<wikibugs>
|
('PS2) ''Filippo Giunchedi: prometheus1006: depool from alertmanager [puppet] - ''https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 08:18:33
|
<logmsgbot>
|
!log urbanecm@deploy2002 Backport cancelled.
|
2023-03-27 08:19:45
|
<wikibugs>
|
('PS1) ''Marostegui: mariadb: Promote db1164 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123)'
|
2023-03-27 08:19:58
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''-2] "Wait for the failover date" [puppet] - ''https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: ''Marostegui)'
|
2023-03-27 08:20:54
|
<wikibugs>
|
('CR) ''Jcrespo: [C: ''+1] Revert "backups: Replace db1164 with db1101" [puppet] - ''https://gerrit.wikimedia.org/r/903188 (owner: ''Marostegui)'
|
2023-03-27 08:20:59
|
<logmsgbot>
|
!log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]]
|
2023-03-27 08:21:01
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''Marostegui)'
|
2023-03-27 08:21:05
|
<stashbot>
|
T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
|
2023-03-27 08:23:23
|
<wikibugs>
|
('PS1) ''Filippo Giunchedi: wmnet: move reads to graphite2004 [dns] - ''https://gerrit.wikimedia.org/r/903185 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 08:24:59
|
<wikibugs>
|
('PS1) ''Filippo Giunchedi: graphite: check graphite2004 [puppet] - ''https://gerrit.wikimedia.org/r/903206 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 08:25:23
|
<wikibugs>
|
('CR) ''Marostegui: [C: ''-2] mariadb: Promote db1164 to m1 master [puppet] - ''https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: ''Marostegui)'
|
2023-03-27 08:25:47
|
<logmsgbot>
|
!log urbanecm@deploy2002 Synchronized wmf-config/InitialiseSettings.php: 63dd23b5ceaba35c8d9682493dd21d99a20fc8f7: [Growth] eswiki: Enable mentorship for 50% of newcomers (T332737, T285235) (duration: 06m 09s)
|
2023-03-27 08:25:54
|
<stashbot>
|
T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235
|
2023-03-27 08:25:54
|
<stashbot>
|
T332737: Increase percentage of newcomers who receive Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T332737
|
2023-03-27 08:26:40
|
<wikibugs>
|
('PS1) ''Filippo Giunchedi: statsd: move writes to graphite2004 [puppet] - ''https://gerrit.wikimedia.org/r/903207 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 08:26:58
|
<jinxer-wm>
|
(KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 08:28:10
|
<wikibugs>
|
('PS1) ''Filippo Giunchedi: wmnet: move writes to graphite2004 [dns] - ''https://gerrit.wikimedia.org/r/903208 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 08:28:14
|
<jynus>
|
!log restarting bacula at backup1001 T331510
|
2023-03-27 08:28:19
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 08:28:20
|
<stashbot>
|
T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
|
2023-03-27 08:30:09
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (''Ladsgroup)'
|
2023-03-27 08:30:29
|
<logmsgbot>
|
!log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
|
2023-03-27 08:30:34
|
<stashbot>
|
T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
|
2023-03-27 08:31:25
|
<wikibugs>
|
('PS1) ''Filippo Giunchedi: Failover statsd to graphite2004 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903209 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 08:31:58
|
<jinxer-wm>
|
(KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 08:32:29
|
<wikibugs>
|
('CR) ''Jelto: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40332/console"; (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: ''Hashar)'
|
2023-03-27 08:32:31
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''+1] "SGTM" [puppet] - ''https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: ''Elukey)'
|
2023-03-27 08:34:48
|
<icinga-wm>
|
RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
|
2023-03-27 08:35:43
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+1] k8s: Force to be explicit about k8s and calico versions [puppet] - ''https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
|
2023-03-27 08:36:04
|
<wikibugs>
|
('CR) ''Elukey: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - ''https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
|
2023-03-27 08:36:11
|
<wikibugs>
|
('PS2) ''Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)'
|
2023-03-27 08:36:44
|
<wikibugs>
|
('CR) ''Jelto: [V: ''+1 C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: ''Hashar)'
|
2023-03-27 08:36:45
|
<jinxer-wm>
|
(JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
2023-03-27 08:38:42
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: ''Ayounsi)'
|
2023-03-27 08:39:15
|
<logmsgbot>
|
!log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] (duration: 18m 15s)
|
2023-03-27 08:39:22
|
<stashbot>
|
T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
|
2023-03-27 08:40:24
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+1] k8s: Remove 1.16 related code (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
|
2023-03-27 08:40:53
|
<wikibugs>
|
('CR) ''Elukey: [C: ''+2] role::kafka::jumbo::broker: enable PKI migration settings [puppet] - ''https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) (owner: ''Elukey)'
|
2023-03-27 08:43:54
|
<wikibugs>
|
'SRE, ''DBA, ''Data-Engineering, ''Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (''fgiunchedi)'
|
2023-03-27 08:45:18
|
<wikibugs>
|
('PS2) ''Hashar: wm-zuul-status: filter out non-live item [software/gerrit] (deploy/wmf/stable-3.5) - ''https://gerrit.wikimedia.org/r/902705 (https://phabricator.wikimedia.org/T214068)'
|
2023-03-27 08:46:46
|
<wikibugs>
|
('CR) ''Clément Goubert: [C: ''+1] prometheus1006: depool from alertmanager [puppet] - ''https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) (owner: ''Filippo Giunchedi)'
|
2023-03-27 08:47:02
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
|
2023-03-27 08:50:14
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+2] prometheus1006: depool from alertmanager [puppet] - ''https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) (owner: ''Filippo Giunchedi)'
|
2023-03-27 08:51:17
|
<jinxer-wm>
|
(KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
|
2023-03-27 08:52:14
|
<godog>
|
hah that was me, false alarm
|
2023-03-27 08:52:37
|
<godog>
|
prometheus1005 was also depooled, I've repooled it now
|
2023-03-27 08:53:39
|
<wikibugs>
|
('CR) ''Jelto: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40333/console"; (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: ''Dduvall)'
|
2023-03-27 08:55:02
|
<logmsgbot>
|
!log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
|
2023-03-27 08:56:17
|
<jinxer-wm>
|
(KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
|
2023-03-27 08:57:02
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''fgiunchedi)'
|
2023-03-27 08:57:19
|
<logmsgbot>
|
!log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for mw-api-int - cgoubert@cumin1001"
|
2023-03-27 08:58:24
|
<logmsgbot>
|
!log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for mw-api-int - cgoubert@cumin1001"
|
2023-03-27 08:58:24
|
<logmsgbot>
|
!log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
|
2023-03-27 09:00:45
|
<wikibugs>
|
('PS1) ''Clément Goubert: mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120)'
|
2023-03-27 09:02:18
|
<wikibugs>
|
('CR) ''Jelto: [V: ''+1] "looks mostly good, one question in-line" [puppet] - ''https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: ''Dduvall)'
|
2023-03-27 09:02:52
|
<icinga-wm>
|
RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 09:03:06
|
<icinga-wm>
|
RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 09:03:10
|
<wikibugs>
|
('CR) ''Giuseppe Lavagetto: [C: ''+1] mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
|
2023-03-27 09:03:33
|
<wikibugs>
|
('CR) ''Clément Goubert: [C: ''+2] mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
|
2023-03-27 09:04:08
|
<wikibugs>
|
'SRE, ''Commons, ''Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (''Aklapper)'
|
2023-03-27 09:06:20
|
<wikibugs>
|
('PS1) ''Clément Goubert: Revert "mw-api-int: Add records" [dns] - ''https://gerrit.wikimedia.org/r/903190'
|
2023-03-27 09:08:25
|
<wikibugs>
|
('CR) ''Clément Goubert: [C: ''+2] Revert "mw-api-int: Add records" [dns] - ''https://gerrit.wikimedia.org/r/903190 (owner: ''Clément Goubert)'
|
2023-03-27 09:12:17
|
<jinxer-wm>
|
(KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
|
2023-03-27 09:12:59
|
<wikibugs>
|
('PS1) ''Clément Goubert: mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120)'
|
2023-03-27 09:13:20
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''Prod-Kubernetes, ''netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (''ayounsi) I pondered multiple options for the Netbox `server_bgp` custom field, feedback from ServiceOps welcome ba...'
|
2023-03-27 09:15:23
|
<wikibugs>
|
('PS2) ''Clément Goubert: mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120)'
|
2023-03-27 09:17:17
|
<jinxer-wm>
|
(KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
|
2023-03-27 09:17:18
|
<wikibugs>
|
('CR) ''Giuseppe Lavagetto: [C: ''+1] mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
|
2023-03-27 09:17:20
|
<wikibugs>
|
('CR) ''Thiemo Kreuz (WMDE): [C: ''+1] mediawiki: Reduce the frequency of flaggedrevs updates (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: ''Ladsgroup)'
|
2023-03-27 09:18:04
|
<wikibugs>
|
('CR) ''Clément Goubert: [C: ''+2] mw-api-int: Add records [dns] - ''https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
|
2023-03-27 09:24:56
|
<wikibugs>
|
('PS3) ''Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)'
|
2023-03-27 09:25:43
|
<wikibugs>
|
('CR) ''Clément Goubert: [V: ''+1] "This change is ready for review." [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
|
2023-03-27 09:27:00
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: ''Ayounsi)'
|
2023-03-27 09:33:54
|
<wikibugs>
|
('PS4) ''Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)'
|
2023-03-27 09:39:55
|
<logmsgbot>
|
!log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
|
2023-03-27 09:40:10
|
<wikibugs>
|
('CR) ''Giuseppe Lavagetto: [C: ''+1] "LGTM, optional nits inline." [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
|
2023-03-27 09:40:59
|
<wikibugs>
|
('PS1) ''Jbond: Offboard nfraison [puppet] - ''https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135)'
|
2023-03-27 09:41:05
|
<logmsgbot>
|
!log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
|
2023-03-27 09:43:37
|
<wikibugs>
|
('PS2) ''Jbond: Offboard nfraison [puppet] - ''https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135)'
|
2023-03-27 09:44:49
|
<logmsgbot>
|
!log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45295
|
2023-03-27 09:45:25
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] Offboard nfraison [puppet] - ''https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135) (owner: ''Jbond)'
|
2023-03-27 09:45:41
|
<logmsgbot>
|
!log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45295
|
2023-03-27 09:46:49
|
<wikibugs>
|
('CR) ''Clément Goubert: [V: ''+1] service_catalog: Add mw-api-int k8s service (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
|
2023-03-27 09:47:07
|
<wikibugs>
|
('PS2) ''Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)'
|
2023-03-27 09:47:09
|
<wikibugs>
|
('CR) ''Effie Mouzeli: [C: ''+1] P:kubernetes::node: Use performance governor [puppet] - ''https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788) (owner: ''Clément Goubert)'
|
2023-03-27 09:47:13
|
<logmsgbot>
|
!log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main1003.eqiad.wmnet with reason: stop kafka and dist-upgrade
|
2023-03-27 09:47:26
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main1003.eqiad.wmnet with reason: stop kafka and dist-upgrade
|
2023-03-27 09:50:02
|
<wikibugs>
|
('PS3) ''Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)'
|
2023-03-27 09:50:53
|
<wikibugs>
|
('CR) ''Clément Goubert: service_catalog: Add mw-api-int k8s service (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: ''Clément Goubert)'
|
2023-03-27 09:51:57
|
<wikibugs>
|
('CR) ''Clément Goubert: [V: ''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40336/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)
(owner: ''Clément Goubert)'
|
2023-03-27 09:54:32
|
<wikibugs>
|
('PS7) ''Filippo Giunchedi: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - ''https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
|
2023-03-27 09:54:34
|
<wikibugs>
|
('PS3) ''Filippo Giunchedi: team-sre/puppet-agent: Add widespread puppet failure (no resources) alert [alerts] - ''https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
|
2023-03-27 09:57:46
|
<wikibugs>
|
('PS2) ''JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)'
|
2023-03-27 09:58:10
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 09:59:29
|
<wikibugs>
|
('CR) ''LSobanski: [C: ''-1] "The change has not been confirmed yet so let's not jump the gun on this." [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 09:59:32
|
<wikibugs>
|
('CR) ''Giuseppe Lavagetto: [V: ''+1 C: ''+2] mediawiki::errorpage: rationalize usage [puppet] - ''https://gerrit.wikimedia.org/r/902446 (owner: ''Giuseppe Lavagetto)'
|
2023-03-27 10:00:04
|
<jouncebot>
|
Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1000)
|
2023-03-27 10:02:10
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''hnowlan)'
|
2023-03-27 10:02:30
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''ArielGlenn)'
|
2023-03-27 10:02:49
|
<wikibugs>
|
('CR) ''Jelto: monitoring/alerting: globally replace serviceops-collab with sre-collab (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 10:03:13
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''hnowlan)'
|
2023-03-27 10:03:49
|
<wikibugs>
|
('PS3) ''JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)'
|
2023-03-27 10:03:51
|
<Emperor>
|
!log depool ms-fe2009
|
2023-03-27 10:03:54
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 10:04:17
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] "thanks" [alerts] - ''https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
|
2023-03-27 10:05:30
|
<wikibugs>
|
('Merged) ''jenkins-bot: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - ''https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
|
2023-03-27 10:05:33
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] team-sre/puppet-agent: Add widespread puppet failure (no resources) alert (''1 comment) [alerts] - ''https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: ''Jbond)'
|
2023-03-27 10:06:12
|
<wikibugs>
|
'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (''Clement_Goubert)'
|
2023-03-27 10:06:24
|
<wikibugs>
|
'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 4 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (''Clement_Goubert) ''Open→''In progress p:''Triage→''Medium a:''Clement_Goubert'
|
2023-03-27 10:06:44
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] team-sre/resource: Add disk space [alerts] - ''https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 10:06:54
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] team-sre/resource: Add disk space [alerts] - ''https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 10:07:10
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: "Ben, does this look good to you? thanks!" [alerts] - ''https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: ''AOkoth)'
|
2023-03-27 10:08:32
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] "Untested but LGTM, thank you Daniel" [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 10:08:44
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] releases: remove Icinga monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: ''Dzahn)'
|
2023-03-27 10:09:15
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 10:09:43
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] etherpad: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 10:10:09
|
<elukey>
|
!log dist-upgrade kafka-main1003 manually to bullseye - T332013
|
2023-03-27 10:10:14
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 10:10:15
|
<stashbot>
|
T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013
|
2023-03-27 10:13:21
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: "LGTM modulo alert name" [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 10:15:18
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 10:15:34
|
<wikibugs>
|
('CR) ''EoghanGaffney: [C: ''+1] etherpad: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 10:17:04
|
<wikibugs>
|
('CR) ''EoghanGaffney: [C: ''+1] releases: remove Icinga monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: ''Dzahn)'
|
2023-03-27 10:17:52
|
<wikibugs>
|
('CR) ''EoghanGaffney: [C: ''+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 10:20:11
|
<wikibugs>
|
('PS1) ''Jbond: cinga: drop nfraison from ACL's [puppet] - ''https://gerrit.wikimedia.org/r/903221 (https://phabricator.wikimedia.org/T333135)'
|
2023-03-27 10:20:17
|
<wikibugs>
|
('CR) ''Giuseppe Lavagetto: [V: ''+1 C: ''+2] mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - ''https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: ''Giuseppe Lavagetto)'
|
2023-03-27 10:21:16
|
<wikibugs>
|
('CR) ''JMeybohm: k8s: Force docker storage-driver to overlay2 (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 10:21:26
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] cinga: drop nfraison from ACL's [puppet] - ''https://gerrit.wikimedia.org/r/903221 (https://phabricator.wikimedia.org/T333135) (owner: ''Jbond)'
|
2023-03-27 10:21:30
|
<icinga-wm>
|
PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2023-03-27 10:22:16
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''+1] "PCC (expected to fail on alert) https://puppet-compiler.wmflabs.org/output/902318/40337/"; [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 10:22:17
|
<jinxer-wm>
|
(KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
|
2023-03-27 10:22:30
|
<wikibugs>
|
('CR) ''JMeybohm: [V: ''+1] k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 10:24:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on kafka_main cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 10:24:44
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
|
2023-03-27 10:24:49
|
<jinxer-wm>
|
(RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
|
2023-03-27 10:24:57
|
<wikibugs>
|
('CR) ''EoghanGaffney: [C: ''+1] peopleweb: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 10:25:31
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
|
2023-03-27 10:27:08
|
<_joe_>
|
jouncebot: next
|
2023-03-27 10:27:09
|
<jouncebot>
|
In 2 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1300)
|
2023-03-27 10:27:15
|
<_joe_>
|
jouncebot: now
|
2023-03-27 10:27:15
|
<jouncebot>
|
For the next 0 hour(s) and 32 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1000)
|
2023-03-27 10:27:17
|
<jinxer-wm>
|
(KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
|
2023-03-27 10:27:28
|
<_joe_>
|
elukey: this sounds promising ^^
|
2023-03-27 10:27:58
|
<jinxer-wm>
|
(KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 10:28:00
|
<logmsgbot>
|
!log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
|
2023-03-27 10:28:02
|
<elukey>
|
yep all recovered :)
|
2023-03-27 10:28:39
|
<logmsgbot>
|
!log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
|
2023-03-27 10:28:49
|
<jinxer-wm>
|
(RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
|
2023-03-27 10:29:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on kafka_main cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 10:29:40
|
<jinxer-wm>
|
(NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
|
2023-03-27 10:30:55
|
<logmsgbot>
|
!log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
|
2023-03-27 10:31:20
|
<logmsgbot>
|
!log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
|
2023-03-27 10:32:58
|
<jinxer-wm>
|
(KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 10:33:49
|
<jinxer-wm>
|
(RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
|
2023-03-27 10:34:39
|
<jinxer-wm>
|
(NodeTextfileStale) resolved: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
|
2023-03-27 10:34:49
|
<jinxer-wm>
|
(RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
|
2023-03-27 10:35:50
|
<wikibugs>
|
('PS4) ''EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - ''https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245)'
|
2023-03-27 10:36:17
|
<jinxer-wm>
|
(KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
|
2023-03-27 10:39:22
|
<elukey>
|
this is due to the roll restart --^
|
2023-03-27 10:39:58
|
<jinxer-wm>
|
(KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 10:41:17
|
<jinxer-wm>
|
(KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
|
2023-03-27 10:41:17
|
<wikibugs>
|
'SRE-tools, ''Infrastructure-Foundations, ''Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (''SLyngshede-WMF) We're missing a "dry_run" for services and puppet, but Puppet doesn't need is as the decorator also checks for _remote_hosts.'
|
2023-03-27 10:41:28
|
<wikibugs>
|
('CR) ''Giuseppe Lavagetto: [C: ''+2] mesh.configuration: add support for custom error pages [deployment-charts] - ''https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: ''Giuseppe Lavagetto)'
|
2023-03-27 10:42:45
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
|
2023-03-27 10:43:18
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond) @Dzahn can you take care of password store'
|
2023-03-27 10:44:58
|
<jinxer-wm>
|
(KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 10:45:17
|
<wikibugs>
|
('PS4) ''Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - ''https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033)'
|
2023-03-27 10:47:02
|
<wikibugs>
|
('Merged) ''jenkins-bot: mesh.configuration: add support for custom error pages [deployment-charts] - ''https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: ''Giuseppe Lavagetto)'
|
2023-03-27 10:48:05
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''BTullis) I will take care of the HBase/Haddoop permissions and any leftover files.'
|
2023-03-27 10:48:19
|
<wikibugs>
|
('CR) ''EoghanGaffney: [C: ''+2] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - ''https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
|
2023-03-27 10:52:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 10:54:14
|
<wikibugs>
|
('CR) ''Superpes15: [huwiki] Add Draft and Draft_talk namespaces (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
|
2023-03-27 10:55:29
|
<wikibugs>
|
('PS6) ''Superpes15: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083)'
|
2023-03-27 10:55:37
|
<wikibugs>
|
('PS7) ''Superpes15: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083)'
|
2023-03-27 10:56:49
|
<icinga-wm>
|
PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 10:57:03
|
<icinga-wm>
|
PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 10:59:26
|
<wikibugs>
|
'SRE-tools, ''Infrastructure-Foundations, ''Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (''SLyngshede-WMF) PuppetMaster Class needs dry_run, this can be done by letting the class inherit from RemoteHostsAdapter. Service class should have a...'
|
2023-03-27 11:01:10
|
<wikibugs>
|
('CR) ''Tacsipacsi: [C: ''+1] "LGTM, thanks!" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
|
2023-03-27 11:02:17
|
<icinga-wm>
|
RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 11:02:35
|
<icinga-wm>
|
RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 11:03:58
|
<jinxer-wm>
|
(KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 11:04:00
|
<wikibugs>
|
('Abandoned) ''Samtar: InitialiseSettings.php: Undeploy Phonos from afwiktionary, arwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/898749 (https://phabricator.wikimedia.org/T332006) (owner: ''Samtar)'
|
2023-03-27 11:06:59
|
<wikibugs>
|
('PS2) ''Jbond: team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764)'
|
2023-03-27 11:07:07
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:07:09
|
<wikibugs>
|
('CR) ''Jbond: team-sre/systemd: add Check systemd state rule (''1 comment) [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:07:11
|
<wikibugs>
|
('PS5) ''Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - ''https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033)'
|
2023-03-27 11:07:13
|
<wikibugs>
|
('PS3) ''Jbond: team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764)'
|
2023-03-27 11:07:28
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:07:47
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] team-sre/hardware: Add alert for sel events [alerts] - ''https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: ''Jbond)'
|
2023-03-27 11:08:58
|
<jinxer-wm>
|
(KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 11:09:54
|
<wikibugs>
|
('PS2) ''Jbond: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764)'
|
2023-03-27 11:10:20
|
<wikibugs>
|
('Merged) ''jenkins-bot: team-sre/hardware: Add alert for sel events [alerts] - ''https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: ''Jbond)'
|
2023-03-27 11:10:35
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''BTullis) I have deleted most of the leftover files and moved useful to my own home directory, but I don't have permission to update the description of this ticket...'
|
2023-03-27 11:11:01
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''Prod-Kubernetes, ''netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (''cmooney) Personally I think it's a big conceptual change to introduce a second separate automation-pipeline for th...'
|
2023-03-27 11:11:27
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
|
2023-03-27 11:13:23
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''Prod-Kubernetes, ''netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (''cmooney) On the Netbox side I'm happy with the current status, or having it as a dropdown. I think it's good to k...'
|
2023-03-27 11:13:44
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: ''Kamila Součková)'
|
2023-03-27 11:15:37
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+1] "LGTM cheers" [software/spicerack] - ''https://gerrit.wikimedia.org/r/902460 (owner: ''Volans)'
|
2023-03-27 11:17:32
|
<wikibugs>
|
('PS1) ''Slyngshede: Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - ''https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537)'
|
2023-03-27 11:19:07
|
<icinga-wm>
|
PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 11:19:25
|
<icinga-wm>
|
PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 11:20:43
|
<wikibugs>
|
('CR) ''Jbond: [V: ''+1 C: ''+2] Remove l10nupdate support [puppet] - ''https://gerrit.wikimedia.org/r/896318 (owner: ''Majavah)'
|
2023-03-27 11:20:58
|
<jbond>
|
taavi: fyi merging ^^
|
2023-03-27 11:23:37
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+1] "lgtm" [cookbooks] - ''https://gerrit.wikimedia.org/r/902449 (owner: ''Volans)'
|
2023-03-27 11:24:40
|
<wikibugs>
|
'SRE, ''DBA, ''Data-Engineering, ''Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (''Jelto)'
|
2023-03-27 11:24:45
|
<icinga-wm>
|
RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 11:25:03
|
<icinga-wm>
|
RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
|
2023-03-27 11:25:37
|
<icinga-wm>
|
PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.044e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
|
2023-03-27 11:27:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 11:34:13
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''Jelto)'
|
2023-03-27 11:36:43
|
<wikibugs>
|
('CR) ''Volans: [C: ''+1] "LGTM, thanks for the addition" [software/spicerack] - ''https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 11:38:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 11:38:55
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:39:17
|
<jinxer-wm>
|
(KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
|
2023-03-27 11:39:21
|
<volans>
|
jbond: did you run the logout cookbook? it seems to affect some puppet runs ^^^
|
2023-03-27 11:43:29
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: "LGTM, modulo Ben's vote" [puppet] - ''https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: ''Cathal Mooney)'
|
2023-03-27 11:44:17
|
<jinxer-wm>
|
(KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
|
2023-03-27 11:44:24
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:44:32
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:45:48
|
<godog>
|
fixing ^
|
2023-03-27 11:46:13
|
<wikibugs>
|
('PS3) ''Filippo Giunchedi: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:46:50
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:47:55
|
<wikibugs>
|
('Merged) ''jenkins-bot: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - ''https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 11:48:15
|
<wikibugs>
|
'SRE, ''serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (''Clement_Goubert) ''Open→''Resolved'
|
2023-03-27 11:48:20
|
<wikibugs>
|
'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (''Clement_Goubert)'
|
2023-03-27 11:55:50
|
<logmsgbot>
|
!log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
|
2023-03-27 11:57:05
|
<wikibugs>
|
('PS3) ''Clément Goubert: Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - ''https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: ''Ahmon Dancy)'
|
2023-03-27 12:00:15
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''RobH) Not sure why procurement was added (so it showed up in my notifications) as this user isn't in the acl*procurement review, they are in the acl*sre-team so I...'
|
2023-03-27 12:00:22
|
<wikibugs>
|
('CR) ''Clément Goubert: [C: ''+2] Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - ''https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: ''Ahmon Dancy)'
|
2023-03-27 12:00:24
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations, ''procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''RobH) @jbond, this task isn't editable by most users (so i cannot remove the invalid project), please remove the procurement project.'
|
2023-03-27 12:01:06
|
<wikibugs>
|
('PS1) ''Slyngshede: Service: Ensure that dry_run is parsed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)'
|
2023-03-27 12:06:33
|
<wikibugs>
|
'SRE-OnFire, ''SRE-Sprint-Week-Sustainability-March2023, ''Gerrit, ''serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (''hashar) ''Open→''Resolved a:''Clement_Goubert I have finally filled the follow up task: {T333143} Marking this on...'
|
2023-03-27 12:07:39
|
<wikibugs>
|
('CR) ''Jbond: [C: ''-1] "a few nits and i think an bug" [puppet] - ''https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: ''JHathaway)'
|
2023-03-27 12:08:53
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
|
2023-03-27 12:09:04
|
<wikibugs>
|
('CR) ''Slyngshede: [C: ''+2] Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - ''https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 12:09:11
|
<wikibugs>
|
('PS1) ''Filippo Giunchedi: hieradata: move alerting_host to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/903238'
|
2023-03-27 12:10:02
|
<wikibugs>
|
('PS2) ''Filippo Giunchedi: hieradata: move alerting_host to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939)'
|
2023-03-27 12:12:40
|
<wikibugs>
|
('Merged) ''jenkins-bot: Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - ''https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 12:13:01
|
<wikibugs>
|
('PS1) ''EoghanGaffney: Set log format to ecs on doc hosts [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245)'
|
2023-03-27 12:13:18
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] Set log format to ecs on doc hosts [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
|
2023-03-27 12:13:24
|
<wikibugs>
|
('PS2) ''EoghanGaffney: Set log format to ecs on doc hosts [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245)'
|
2023-03-27 12:13:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 12:15:54
|
<wikibugs>
|
('CR) ''JMeybohm: k8s: Remove 1.16 related code (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
|
2023-03-27 12:15:58
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 12:17:09
|
<wikibugs>
|
('Merged) ''jenkins-bot: team-sre/systemd: add Check systemd state rule [alerts] - ''https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: ''Jbond)'
|
2023-03-27 12:17:16
|
<wikibugs>
|
('PS1) ''Jbond: Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192'
|
2023-03-27 12:17:25
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
|
2023-03-27 12:17:28
|
<wikibugs>
|
('CR) ''Jbond: [V: ''+2 C: ''+2] Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
|
2023-03-27 12:17:41
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
|
2023-03-27 12:19:10
|
<wikibugs>
|
('PS2) ''Jbond: Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192'
|
2023-03-27 12:19:52
|
<wikibugs>
|
('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40338/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939)
(owner: ''Filippo Giunchedi)'
|
2023-03-27 12:19:54
|
<wikibugs>
|
('CR) ''Jbond: [C: ''+2] Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
|
2023-03-27 12:20:48
|
<wikibugs>
|
('CR) ''JMeybohm: [V: ''+1 C: ''+1] hieradata: move alerting_host to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: ''Filippo Giunchedi)'
|
2023-03-27 12:21:06
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+2] hieradata: move alerting_host to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: ''Filippo Giunchedi)'
|
2023-03-27 12:21:44
|
<wikibugs>
|
('Merged) ''jenkins-bot: Revert "team-sre/hardware: Add alert for sel events" [alerts] - ''https://gerrit.wikimedia.org/r/903192 (owner: ''Jbond)'
|
2023-03-27 12:22:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 12:23:06
|
<wikibugs>
|
('CR) ''JMeybohm: [V: ''+1 C: ''+2] k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - ''https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
|
2023-03-27 12:23:42
|
<wikibugs>
|
('CR) ''JMeybohm: [V: ''+1 C: ''+2] k8s: Force to be explicit about k8s and calico versions [puppet] - ''https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
|
2023-03-27 12:27:03
|
<wikibugs>
|
('CR) ''Jbond: "lgtm but im not sure we need this in the service class, the alertmanager instance is already set correctly which from what i see is the on" [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 12:32:36
|
<wikibugs>
|
('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40339/console"; [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)
(owner: ''JMeybohm)'
|
2023-03-27 12:36:03
|
<jinxer-wm>
|
(ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
|
2023-03-27 12:40:40
|
<wikibugs>
|
('PS4) ''JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)'
|
2023-03-27 12:41:03
|
<jinxer-wm>
|
(ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
|
2023-03-27 12:42:34
|
<godog>
|
!log flip alert* to overlay2 - T329939
|
2023-03-27 12:42:39
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 12:42:40
|
<stashbot>
|
T329939: alert hosts short of root disk space / docker devicemapper vs overlayfs - https://phabricator.wikimedia.org/T329939
|
2023-03-27 12:46:36
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 12:47:01
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (5) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 12:49:01
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 12:49:09
|
<wikibugs>
|
('CR) ''Hashar: releases-jenkins: replace Icinga with Prometheus monitoring (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 12:50:17
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''+2] k8s: Force docker storage-driver to overlay2 [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 12:50:44
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''+2] k8s: Force docker storage-driver to overlay2 (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 12:51:31
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (15) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 12:51:32
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 12:52:56
|
<wikibugs>
|
('CR) ''Hashar: [C: ''+1] "Awesome! Feel free to deploy at any time. If Apache2 needs to be restarted that can be done at anytime (the impact is minimal, it is simpl" [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
|
2023-03-27 12:57:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 12:58:50
|
<wikibugs>
|
('PS1) ''Btullis: Upgrade the research airflow instance [puppet] - ''https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193)'
|
2023-03-27 12:59:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on thumbor cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thumbor - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 13:00:05
|
<jouncebot>
|
RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1300)
|
2023-03-27 13:00:05
|
<jouncebot>
|
Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
|
2023-03-27 13:00:14
|
<taavi>
|
o/ I can deploy
|
2023-03-27 13:00:20
|
<Superpes>
|
Hi taavi :)
|
2023-03-27 13:00:37
|
<wikibugs>
|
('CR) ''Btullis: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40340/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193)
(owner: ''Btullis)'
|
2023-03-27 13:02:17
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C: ''+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
|
2023-03-27 13:03:08
|
<wikibugs>
|
('Merged) ''jenkins-bot: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: ''Superpes15)'
|
2023-03-27 13:03:32
|
<logmsgbot>
|
!log taavi@deploy2002 Started scap: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]]
|
2023-03-27 13:03:38
|
<stashbot>
|
T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
|
2023-03-27 13:03:55
|
<wikibugs>
|
('CR) ''Hashar: "To clarify: +1 overall, the remarks I have made in the diff comment can be implemented or ruled out later ;)" [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 13:04:14
|
<wikibugs>
|
('PS2) ''Slyngshede: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)'
|
2023-03-27 13:04:58
|
<logmsgbot>
|
!log taavi@deploy2002 superpes and taavi: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
|
2023-03-27 13:05:02
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests, ''WMF-Legal, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''oleksandr_tsyba_WMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public...'
|
2023-03-27 13:05:02
|
<taavi>
|
Superpes: please test
|
2023-03-27 13:05:06
|
<Superpes>
|
Looking
|
2023-03-27 13:05:19
|
<wikibugs>
|
('CR) ''Slyngshede: Service: Ensure that dry_run is passed to dataclass. (''3 comments) [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 13:05:32
|
<wikibugs>
|
('PS3) ''Slyngshede: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)'
|
2023-03-27 13:06:07
|
<Superpes>
|
Looks fine thanks :) taavi
|
2023-03-27 13:07:26
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''WMDE-leszek)'
|
2023-03-27 13:07:56
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''WMDE-leszek) Cleaned up the tags a bit, apologies @oleksandr_tsyba_WMDE. we have used a wrong template, again'
|
2023-03-27 13:08:07
|
<wikibugs>
|
('CR) ''Jbond: "thanks" [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 13:08:40
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''WMDE-leszek) On that note, I endorse this request on WMDE's end.'
|
2023-03-27 13:11:31
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (19) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 13:12:17
|
<logmsgbot>
|
!log taavi@deploy2002 Finished scap: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] (duration: 08m 45s)
|
2023-03-27 13:12:23
|
<stashbot>
|
T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
|
2023-03-27 13:12:29
|
<taavi>
|
done!
|
2023-03-27 13:13:44
|
<Superpes>
|
Thanks taavi (maybe you have to run NamespaceDupes.php) :)
|
2023-03-27 13:13:51
|
<taavi>
|
ohhhh right
|
2023-03-27 13:13:52
|
<taavi>
|
a sec
|
2023-03-27 13:14:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on thumbor cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thumbor - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 13:16:31
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (64) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 13:17:29
|
<wikibugs>
|
('CR) ''Volans: [C: ''+1] "LGTM, thanks!" [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 13:18:01
|
<taavi>
|
hm I suspect the script might be broken, it's just printing the same few pagelinks rows over and over again
|
2023-03-27 13:18:05
|
<taavi>
|
Amir1: ^ any clues why?
|
2023-03-27 13:18:18
|
<wikibugs>
|
('PS1) ''Elukey: Move kafka-jumbo1001's kafka broker to PKI certs [puppet] - ''https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064)'
|
2023-03-27 13:19:18
|
<wikibugs>
|
('PS1) ''Ssingh: hiera: temporarily removed dns1003 from authdns_servers [puppet] - ''https://gerrit.wikimedia.org/r/903246 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 13:19:44
|
<wikibugs>
|
('CR) ''Elukey: [V: ''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40341/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064)
(owner: ''Elukey)'
|
2023-03-27 13:20:23
|
<taavi>
|
looking at wmf.1 changelog I don't see anything helpful
|
2023-03-27 13:24:17
|
<Amir1>
|
sorry I was having lunch, let me check
|
2023-03-27 13:24:22
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''netbox, ''netops, ''Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (''cmooney) Hoping to kick-start some more discussion around this and try to close this out. I still firmly
believe tha...'
|
2023-03-27 13:24:50
|
<taavi>
|
basically namespaceDupes seems to not update the WHERE condition when printing the list of pagelinks rows it would need to update
|
2023-03-27 13:24:54
|
<Amir1>
|
yeah, NameSpacesDupes is broken
|
2023-03-27 13:25:03
|
<zabe>
|
I got the same issue a week or so age (sorry, forgot to create a task), but it didn't show up when running with --fix
|
2023-03-27 13:25:06
|
<taavi>
|
I don't know if it's pagelinks specific or a wider issue
|
2023-03-27 13:27:11
|
<TheresNoTime>
|
oh, I got that in https://phabricator.wikimedia.org/P45894 too, about a week ago
|
2023-03-27 13:27:16
|
<Amir1>
|
yeah it's broken, file a task and I'll take a look
|
2023-03-27 13:27:19
|
<wikibugs>
|
('PS1) ''Andrew Bogott: clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - ''https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 13:27:47
|
<Amir1>
|
it used to cause data corruption, I'm fine with the current state to be honest
|
2023-03-27 13:27:56
|
<elukey>
|
hey folks lemme know when the backport window close (no rush), after that I'll start some maintenance to redis misc clusters
|
2023-03-27 13:28:02
|
<elukey>
|
*closes
|
2023-03-27 13:28:33
|
<taavi>
|
elukey: we're debugging a maintenance script, might take a while
|
2023-03-27 13:29:14
|
<taavi>
|
Amir1: yeah I think I'd prefer leaving some broken rows for now over blindly running with --fix
|
2023-03-27 13:29:25
|
<Amir1>
|
I don't think we can fix the issue right now
|
2023-03-27 13:29:52
|
<Amir1>
|
let it be, links tables always have some sorta drifts
|
2023-03-27 13:30:14
|
<Amir1>
|
my hope would be to do the important fixes and the links one as an argument
|
2023-03-27 13:30:16
|
<Amir1>
|
but meh
|
2023-03-27 13:30:38
|
<taavi>
|
hm
|
2023-03-27 13:31:00
|
<taavi>
|
although this is breaking access to those actual pages
|
2023-03-27 13:32:14
|
<taavi>
|
so I don't want to leave that broken either
|
2023-03-27 13:34:51
|
<wikibugs>
|
('CR) ''Btullis: [V: ''+1 C: ''+2] Upgrade the research airflow instance [puppet] - ''https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193) (owner: ''Btullis)'
|
2023-03-27 13:35:23
|
<logmsgbot>
|
!log fab@deploy2002 Started deploy [airflow-dags/research@d2c115d]: (no justification provided)
|
2023-03-27 13:35:44
|
<logmsgbot>
|
!log fab@deploy2002 Finished deploy [airflow-dags/research@d2c115d]: (no justification provided) (duration: 00m 21s)
|
2023-03-27 13:36:31
|
<wikibugs>
|
('CR) ''Andrew Bogott: [C: ''+2] rbd2backy2: clean up a debugging line [puppet] - ''https://gerrit.wikimedia.org/r/900648 (owner: ''Andrew Bogott)'
|
2023-03-27 13:37:12
|
<Amir1>
|
taavi: can you just comment out the links updates in maint script and re-run it?
|
2023-03-27 13:40:37
|
<taavi>
|
Amir1: I think I found the issue
|
2023-03-27 13:41:26
|
<taavi>
|
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/829293 should have removed the addQuotes() calls from namespaceDupes.php as buildComparison does it for you
|
2023-03-27 13:41:51
|
<taavi>
|
patch incoming
|
2023-03-27 13:45:07
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''netbox, ''netops, ''Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (''ayounsi) Overall I agree it's an improvement to have the parent interfaces defined in Netbox. I lost a bit
context o...'
|
2023-03-27 13:46:14
|
<taavi>
|
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/903253
|
2023-03-27 13:46:44
|
<wikibugs>
|
('PS1) ''Btullis: Remove stray referece to ariflow db from research instance [puppet] - ''https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193)'
|
2023-03-27 13:47:57
|
<Amir1>
|
thanks for catching it
|
2023-03-27 13:48:24
|
<wikibugs>
|
('CR) ''Btullis: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40342/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193)
(owner: ''Btullis)'
|
2023-03-27 13:48:50
|
<wikibugs>
|
('PS1) ''Majavah: namespaceDupes: Remove extra addQuotes() calls [core] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166)'
|
2023-03-27 13:49:15
|
<wikibugs>
|
('CR) ''Btullis: [V: ''+1 C: ''+2] Remove stray referece to ariflow db from research instance [puppet] - ''https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193) (owner: ''Btullis)'
|
2023-03-27 13:49:21
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C: ''+2] "Approved by taavi@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) (owner: ''Majavah)'
|
2023-03-27 13:50:37
|
<icinga-wm>
|
RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 2 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
|
2023-03-27 13:53:11
|
<wikibugs>
|
('CR) ''Andrew Bogott: [C: ''+2] dumps: properly absent enterprise timers [puppet] - ''https://gerrit.wikimedia.org/r/902833 (owner: ''Majavah)'
|
2023-03-27 13:55:58
|
<wikibugs>
|
'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (''JJMC89)'
|
2023-03-27 13:58:22
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''netbox, ''netops, ''Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (''cmooney) >>! In T296832#8729090, @ayounsi wrote: > I lost a bit context on how it will be done on a
day to bay basis,...'
|
2023-03-27 13:58:39
|
<wikibugs>
|
('PS1) ''Majavah: hieradata: swap eqiad1 dns server order [puppet] - ''https://gerrit.wikimedia.org/r/903257'
|
2023-03-27 13:58:41
|
<wikibugs>
|
('PS1) ''Majavah: hieradata: remove unused keys from labsdnsconfig [puppet] - ''https://gerrit.wikimedia.org/r/903258'
|
2023-03-27 14:00:21
|
<wikibugs>
|
('CR) ''Jelto: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
|
2023-03-27 14:00:58
|
<wikibugs>
|
('CR) ''Andrew Bogott: [C: ''+2] hieradata: swap eqiad1 dns server order [puppet] - ''https://gerrit.wikimedia.org/r/903257 (owner: ''Majavah)'
|
2023-03-27 14:01:22
|
<wikibugs>
|
('CR) ''EoghanGaffney: [C: ''+2] Set log format to ecs on doc hosts [puppet] - ''https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
|
2023-03-27 14:01:25
|
<wikibugs>
|
('PS1) ''Majavah: Point openstack.eqiad1 CNAME to cloudcontrol1007 [dns] - ''https://gerrit.wikimedia.org/r/903259'
|
2023-03-27 14:02:07
|
<wikibugs>
|
('CR) ''Giuseppe Lavagetto: [C: ''+2] Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - ''https://gerrit.wikimedia.org/r/902078 (owner: ''Giuseppe Lavagetto)'
|
2023-03-27 14:04:57
|
<wikibugs>
|
('CR) ''Slyngshede: [C: ''+2] Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 14:05:59
|
<wikibugs>
|
('CR) ''Andrew Bogott: [C: ''+2] Point openstack.eqiad1 CNAME to cloudcontrol1007 [dns] - ''https://gerrit.wikimedia.org/r/903259 (owner: ''Majavah)'
|
2023-03-27 14:06:20
|
<wikibugs>
|
('Merged) ''jenkins-bot: namespaceDupes: Remove extra addQuotes() calls [core] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) (owner: ''Majavah)'
|
2023-03-27 14:06:36
|
<logmsgbot>
|
!log taavi@deploy2002 Started scap: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]]
|
2023-03-27 14:06:43
|
<stashbot>
|
T333166: namespaceDupes is broken - https://phabricator.wikimedia.org/T333166
|
2023-03-27 14:07:17
|
<wikibugs>
|
('Merged) ''jenkins-bot: Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - ''https://gerrit.wikimedia.org/r/902078 (owner: ''Giuseppe Lavagetto)'
|
2023-03-27 14:08:00
|
<logmsgbot>
|
!log taavi@deploy2002 taavi: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
|
2023-03-27 14:08:03
|
<wikibugs>
|
('PS1) ''Hashar: gerrit: set gitiles clone url to http (Gerrit 3.6.2) [puppet] - ''https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049)'
|
2023-03-27 14:09:18
|
<wikibugs>
|
('Merged) ''jenkins-bot: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - ''https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: ''Slyngshede)'
|
2023-03-27 14:10:57
|
<elukey>
|
jouncebot: next
|
2023-03-27 14:10:57
|
<jouncebot>
|
In 1 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1530)
|
2023-03-27 14:11:19
|
<taavi>
|
elukey: give me just a few more minutes please
|
2023-03-27 14:12:13
|
<elukey>
|
sure, I was just checking next windows :)
|
2023-03-27 14:14:39
|
<logmsgbot>
|
!log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
|
2023-03-27 14:14:39
|
<logmsgbot>
|
!log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
|
2023-03-27 14:14:51
|
<logmsgbot>
|
!log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
|
2023-03-27 14:15:04
|
<logmsgbot>
|
!log taavi@deploy2002 Finished scap: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] (duration: 08m 27s)
|
2023-03-27 14:15:09
|
<stashbot>
|
T333166: namespaceDupes is broken - https://phabricator.wikimedia.org/T333166
|
2023-03-27 14:15:09
|
<logmsgbot>
|
!log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
|
2023-03-27 14:15:32
|
<wikibugs>
|
('CR) ''JHathaway: [C: ''+1] "Thanks for removing this cruft, looks good to me!" [puppet] - ''https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: ''JMeybohm)'
|
2023-03-27 14:16:03
|
<taavi>
|
!log taavi@mwmaint2002 ~ $ mwscript namespaceDupes.php --wiki=huwiki --fix # T333083
|
2023-03-27 14:16:07
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 14:16:08
|
<stashbot>
|
T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
|
2023-03-27 14:16:09
|
<taavi>
|
elukey: all done!
|
2023-03-27 14:16:20
|
<elukey>
|
nice thanks!
|
2023-03-27 14:16:31
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (71) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 14:16:59
|
<Superpes>
|
Wow wonderful taavi
|
2023-03-27 14:17:09
|
<logmsgbot>
|
!log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
|
2023-03-27 14:17:16
|
<Superpes>
|
Thanks :)
|
2023-03-27 14:17:17
|
<logmsgbot>
|
!log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
|
2023-03-27 14:21:08
|
<wikibugs>
|
('PS2) ''Andrew Bogott: clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - ''https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165)'
|
2023-03-27 14:21:10
|
<wikibugs>
|
('PS1) ''Andrew Bogott: Mark cloudvirt1017, 1021 and 1022 as spare systems. [puppet] - ''https://gerrit.wikimedia.org/r/903263 (https://phabricator.wikimedia.org/T333169)'
|
2023-03-27 14:21:25
|
<wikibugs>
|
('PS8) ''Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - ''https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)'
|
2023-03-27 14:21:32
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (73) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 14:24:26
|
<wikibugs>
|
('CR) ''Andrew Bogott: [C: ''+2] clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - ''https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165) (owner: ''Andrew Bogott)'
|
2023-03-27 14:24:53
|
<wikibugs>
|
('CR) ''Andrew Bogott: [C: ''+2] Mark cloudvirt1017, 1021 and 1022 as spare systems. [puppet] - ''https://gerrit.wikimedia.org/r/903263 (https://phabricator.wikimedia.org/T333169) (owner: ''Andrew Bogott)'
|
2023-03-27 14:27:22
|
<wikibugs>
|
('PS1) ''EoghanGaffney: Assign insetup role to new aphlict vm [puppet] - ''https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369)'
|
2023-03-27 14:27:28
|
<wikibugs>
|
('PS1) ''Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - ''https://gerrit.wikimedia.org/r/903265'
|
2023-03-27 14:27:45
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
|
2023-03-27 14:28:05
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
|
2023-03-27 14:28:14
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
|
2023-03-27 14:28:33
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
|
2023-03-27 14:28:57
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
|
2023-03-27 14:29:13
|
<wikibugs>
|
('PS1) ''Bking: rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675)'
|
2023-03-27 14:29:14
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
|
2023-03-27 14:29:23
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
|
2023-03-27 14:29:39
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
|
2023-03-27 14:29:45
|
<wikibugs>
|
('CR) ''DCausse: [C: ''+1] rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: ''Bking)'
|
2023-03-27 14:30:04
|
<wikibugs>
|
('CR) ''Bking: [C: ''+2] rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: ''Bking)'
|
2023-03-27 14:30:25
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
|
2023-03-27 14:33:44
|
<wikibugs>
|
('PS2) ''Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - ''https://gerrit.wikimedia.org/r/903265'
|
2023-03-27 14:34:52
|
<wikibugs>
|
('CR) ''Slyngshede: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40344/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903265 (owner: ''Slyngshede)'
|
2023-03-27 14:35:45
|
<wikibugs>
|
('Merged) ''jenkins-bot: rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: ''Bking)'
|
2023-03-27 14:38:31
|
<wikibugs>
|
('CR) ''Hnowlan: [C: ''+1] changeprop-jobqueue: Double resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/902067 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 14:39:21
|
<logmsgbot>
|
!log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
|
2023-03-27 14:39:28
|
<logmsgbot>
|
!log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
|
2023-03-27 14:40:16
|
<logmsgbot>
|
!log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
|
2023-03-27 14:40:32
|
<wikibugs>
|
('CR) ''Slyngshede: [V: ''+1] "During sprint-week I noticed that we're not collecting Squid access logs from the urldownload servers." [puppet] - ''https://gerrit.wikimedia.org/r/903265 (owner: ''Slyngshede)'
|
2023-03-27 14:40:34
|
<logmsgbot>
|
!log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
|
2023-03-27 14:40:56
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
|
2023-03-27 14:41:56
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (''MNadrofsky) Approved.'
|
2023-03-27 14:43:28
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
|
2023-03-27 14:43:30
|
<wikibugs>
|
('PS1) ''Andrew Bogott: Revert "Mark cloudvirt1017, 1021 and 1022 as spare systems." [puppet] - ''https://gerrit.wikimedia.org/r/903268 (https://phabricator.wikimedia.org/T333169)'
|
2023-03-27 14:43:59
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
|
2023-03-27 14:44:29
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
|
2023-03-27 14:44:55
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
|
2023-03-27 14:45:06
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
|
2023-03-27 14:45:15
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
|
2023-03-27 14:46:27
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
|
2023-03-27 14:46:35
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
|
2023-03-27 14:46:37
|
<wikibugs>
|
('CR) ''Andrew Bogott: [C: ''+2] Revert "Mark cloudvirt1017, 1021 and 1022 as spare systems." [puppet] - ''https://gerrit.wikimedia.org/r/903268 (https://phabricator.wikimedia.org/T333169) (owner: ''Andrew Bogott)'
|
2023-03-27 14:47:41
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
|
2023-03-27 14:47:55
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
|
2023-03-27 14:48:05
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
|
2023-03-27 14:48:15
|
<logmsgbot>
|
!log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
|
2023-03-27 14:52:34
|
<logmsgbot>
|
!log eoghan@cumin1001 START - Cookbook sre.ganeti.makevm for new host aphlict1002.eqiad.wmnet
|
2023-03-27 14:52:35
|
<logmsgbot>
|
!log eoghan@cumin1001 START - Cookbook sre.dns.netbox
|
2023-03-27 14:53:41
|
<wikibugs>
|
'SRE-Access-Requests, ''Lift-Wing, ''Machine-Learning-Team: Machine Learning team - k8s resources ccess - https://phabricator.wikimedia.org/T333174 (''isarantopoulos)'
|
2023-03-27 14:55:03
|
<logmsgbot>
|
!log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict1002.eqiad.wmnet - eoghan@cumin1001"
|
2023-03-27 14:55:07
|
<wikibugs>
|
'SRE-Access-Requests, ''Lift-Wing, ''Machine-Learning-Team: Machine Learning team - k8s resources ccess - https://phabricator.wikimedia.org/T333174 (''isarantopoulos)'
|
2023-03-27 14:56:07
|
<logmsgbot>
|
!log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict1002.eqiad.wmnet - eoghan@cumin1001"
|
2023-03-27 14:56:07
|
<logmsgbot>
|
!log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
|
2023-03-27 14:56:07
|
<logmsgbot>
|
!log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache aphlict1002.eqiad.wmnet on all recursors
|
2023-03-27 14:56:10
|
<logmsgbot>
|
!log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aphlict1002.eqiad.wmnet on all recursors
|
2023-03-27 14:57:30
|
<wikibugs>
|
'SRE, ''Data-Persistence, ''serviceops, ''Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (''Trizek-WMF)'
|
2023-03-27 14:57:42
|
<wikibugs>
|
'SRE, ''serviceops, ''CommRel-Specialists-Support (Jan-Mar-2023), ''Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (''Trizek-WMF) ''In progress→''Resolved A
post-action document has been created. There is nothing special to highl...'
|
2023-03-27 14:57:50
|
<wikibugs>
|
('CR) ''Jbond: "see inline" [puppet] - ''https://gerrit.wikimedia.org/r/903265 (owner: ''Slyngshede)'
|
2023-03-27 14:58:18
|
<wikibugs>
|
('PS4) ''Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - ''https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782)'
|
2023-03-27 15:01:31
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (16) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 15:05:52
|
<logmsgbot>
|
!log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aphlict1002.eqiad.wmnet
|
2023-03-27 15:11:31
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (16) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 15:11:49
|
<wikibugs>
|
('PS10) ''Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130'
|
2023-03-27 15:11:52
|
<wikibugs>
|
('PS20) ''Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093'
|
2023-03-27 15:14:02
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
|
2023-03-27 15:14:14
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093 (owner: ''Jbond)'
|
2023-03-27 15:15:17
|
<wikibugs>
|
('PS1) ''Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)'
|
2023-03-27 15:16:19
|
<wikibugs>
|
('PS2) ''Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)'
|
2023-03-27 15:16:32
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (26) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 15:16:32
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (13) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 15:17:00
|
<logmsgbot>
|
!log elukey@deploy2002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 06m 10s)
|
2023-03-27 15:17:45
|
<icinga-wm>
|
PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2023-03-27 15:17:58
|
<jinxer-wm>
|
(KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 15:19:57
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
|
2023-03-27 15:20:32
|
<wikibugs>
|
('CR) ''Vgutierrez: [V: ''+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40345/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)
(owner: ''Vgutierrez)'
|
2023-03-27 15:20:39
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
|
2023-03-27 15:20:44
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
|
2023-03-27 15:20:48
|
<logmsgbot>
|
!log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
|
2023-03-27 15:21:32
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (53) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 15:21:32
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (13) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 15:22:18
|
<wikibugs>
|
('PS21) ''Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093'
|
2023-03-27 15:22:22
|
<wikibugs>
|
('PS3) ''Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)'
|
2023-03-27 15:22:25
|
<wikibugs>
|
('PS8) ''Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135'
|
2023-03-27 15:22:35
|
<wikibugs>
|
('PS9) ''Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135'
|
2023-03-27 15:22:58
|
<jinxer-wm>
|
(KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 15:23:27
|
<icinga-wm>
|
PROBLEM - puppet last run on rdb2008 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
|
2023-03-27 15:24:49
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135 (owner: ''Jbond)'
|
2023-03-27 15:25:12
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093 (owner: ''Jbond)'
|
2023-03-27 15:26:19
|
<wikibugs>
|
'SRE-swift-storage, ''Data-Engineering-Planning, ''Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (''Ottomata) > usefulness of cross-DC replication After asking @dcausse, I unde...'
|
2023-03-27 15:27:16
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''netbox, ''netops, ''Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (''Volans) Looks ok to me too, I'm no sure about all the details involved if we need to patch things like the dns
genera...'
|
2023-03-27 15:29:13
|
<icinga-wm>
|
PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s7.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2023-03-27 15:29:23
|
<icinga-wm>
|
RECOVERY - puppet last run on rdb2008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
|
2023-03-27 15:29:45
|
<wikibugs>
|
('CR) ''BBlack: [C: ''+1] "Seems right to me, for this testing!" [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: ''Vgutierrez)'
|
2023-03-27 15:30:05
|
<jouncebot>
|
jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1530)
|
2023-03-27 15:32:44
|
<wikibugs>
|
('PS22) ''Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093'
|
2023-03-27 15:32:55
|
<wikibugs>
|
('PS10) ''Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135'
|
2023-03-27 15:33:00
|
<wikibugs>
|
('PS11) ''Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135'
|
2023-03-27 15:34:49
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - ''https://gerrit.wikimedia.org/r/849093 (owner: ''Jbond)'
|
2023-03-27 15:35:46
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - ''https://gerrit.wikimedia.org/r/849135 (owner: ''Jbond)'
|
2023-03-27 15:36:36
|
<wikibugs>
|
'SRE, ''MediaWiki-extensions-OAuth, ''Performance-Team, ''Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (''Samwalton9)'
|
2023-03-27 15:42:03
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (''Ottomata) Hi, just back from vacation too. @FNavas-foundation can you update the task description with exactly what you need access too? Your comment mentions a 'spe...'
|
2023-03-27 15:44:15
|
<icinga-wm>
|
RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2023-03-27 15:46:31
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (11) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 15:46:51
|
<wikibugs>
|
('PS1) ''Ayounsi: Varnish: prefix 403 and 429 with a unique ID [puppet] - ''https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973)'
|
2023-03-27 15:46:55
|
<wikibugs>
|
('PS1) ''Filippo Giunchedi: alertmanager: default to IRC for foundations [puppet] - ''https://gerrit.wikimedia.org/r/903285'
|
2023-03-27 15:47:02
|
<godog>
|
jbond: ^
|
2023-03-27 15:50:26
|
<wikibugs>
|
('CR) ''Jelto: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: ''EoghanGaffney)'
|
2023-03-27 15:50:53
|
<wikibugs>
|
('Abandoned) ''Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - ''https://gerrit.wikimedia.org/r/903228 (owner: ''L10n-bot)'
|
2023-03-27 15:51:53
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (''Ottomata) I think they need [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Levels | sql_lab role permissions in Superset ]]. Pinging @Milimetr...'
|
2023-03-27 15:53:35
|
<wikibugs>
|
('CR) ''Hnowlan: [C: ''+2] admin: add user kamila [puppet] - ''https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: ''Kamila Součková)'
|
2023-03-27 15:53:38
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (''hnowlan)'
|
2023-03-27 15:54:04
|
<wikibugs>
|
'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (''Clement_Goubert)'
|
2023-03-27 15:54:52
|
<logmsgbot>
|
!log ebysans@deploy2002 Started deploy [airflow-dags/analytics@e7f9c7f]: (no justification provided)
|
2023-03-27 15:55:03
|
<logmsgbot>
|
!log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@e7f9c7f]: (no justification provided) (duration: 00m 11s)
|
2023-03-27 15:55:44
|
<wikibugs>
|
('PS11) ''Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130'
|
2023-03-27 15:56:40
|
<wikibugs>
|
('PS4) ''Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - ''https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)'
|
2023-03-27 15:57:13
|
<wikibugs>
|
('CR) ''Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (''6 comments) [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
|
2023-03-27 15:57:56
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
|
2023-03-27 15:58:10
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C: ''+2] alertmanager: default to IRC for foundations [puppet] - ''https://gerrit.wikimedia.org/r/903285 (owner: ''Filippo Giunchedi)'
|
2023-03-27 15:58:14
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] planet: Add Wikimedia category of Jan Ainali's blog [puppet] - ''https://gerrit.wikimedia.org/r/902829 (owner: ''Legoktm)'
|
2023-03-27 15:58:45
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] planet: Add Nemo_bis's new blog [puppet] - ''https://gerrit.wikimedia.org/r/902828 (owner: ''Legoktm)'
|
2023-03-27 15:59:16
|
<wikibugs>
|
('PS2) ''Dzahn: planet: Add Wikimedia category of Jan Ainali's blog [puppet] - ''https://gerrit.wikimedia.org/r/902829 (owner: ''Legoktm)'
|
2023-03-27 16:03:53
|
<wikibugs>
|
('PS12) ''Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130'
|
2023-03-27 16:04:12
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] planet: Use a more human date format than "Friday, 03 2023 March" [puppet] - ''https://gerrit.wikimedia.org/r/902832 (owner: ''Krinkle)'
|
2023-03-27 16:04:15
|
<wikibugs>
|
('CR) ''Alexandros Kosiaris: [C: ''+2] admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - ''https://gerrit.wikimedia.org/r/902066 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:04:22
|
<wikibugs>
|
('CR) ''Alexandros Kosiaris: [C: ''+2] changeprop-jobqueue: Double resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/902067 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:06:10
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
|
2023-03-27 16:06:52
|
<wikibugs>
|
('CR) ''Jbond: "for the mypy alerts we need to wait for a spicerack release" [cookbooks] - ''https://gerrit.wikimedia.org/r/849130 (owner: ''Jbond)'
|
2023-03-27 16:08:14
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+1] "looks good, I do want to rename the role to sre_collab, but that will require rebasing one way or another" [puppet] - ''https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: ''EoghanGaffney)'
|
2023-03-27 16:10:03
|
<wikibugs>
|
('Merged) ''jenkins-bot: admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - ''https://gerrit.wikimedia.org/r/902066 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:10:05
|
<wikibugs>
|
('Merged) ''jenkins-bot: changeprop-jobqueue: Double resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/902067 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:14:06
|
<wikibugs>
|
('PS1) ''Alexandros Kosiaris: admin: Grant kserve API group read access to deploy user [deployment-charts] - ''https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174)'
|
2023-03-27 16:15:05
|
<wikibugs>
|
('CR) ''Alexandros Kosiaris: "Luca, Janis, regardless of the outcome of the discussion in the linked task, let me know if this is the preferable way of doing this." [deployment-charts] - ''https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:20:52
|
<wikibugs>
|
('PS1) ''Alexandros Kosiaris: admin: Fix mw-web resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/903298'
|
2023-03-27 16:25:33
|
<wikibugs>
|
('PS1) ''Alexandros Kosiaris: admin: Make sure resource quotas are honored for staging too [deployment-charts] - ''https://gerrit.wikimedia.org/r/903299'
|
2023-03-27 16:26:14
|
<wikibugs>
|
('CR) ''Alexandros Kosiaris: [C: ''+2] admin: Fix mw-web resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/903298 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:29:40
|
<wikibugs>
|
('PS1) ''Jbond: os-reports: fix yaml data for apt_repo [puppet] - ''https://gerrit.wikimedia.org/r/903301'
|
2023-03-27 16:30:22
|
<wikibugs>
|
('CR) ''Jbond: [V: ''+2 C: ''+2] os-reports: fix yaml data for apt_repo [puppet] - ''https://gerrit.wikimedia.org/r/903301 (owner: ''Jbond)'
|
2023-03-27 16:31:05
|
<wikibugs>
|
('Merged) ''jenkins-bot: admin: Fix mw-web resource quotas [deployment-charts] - ''https://gerrit.wikimedia.org/r/903298 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:31:29
|
<wikibugs>
|
('CR) ''Alexandros Kosiaris: [C: ''+2] admin: Make sure resource quotas are honored for staging too [deployment-charts] - ''https://gerrit.wikimedia.org/r/903299 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:32:35
|
<icinga-wm>
|
RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2023-03-27 16:33:28
|
<logmsgbot>
|
!log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
|
2023-03-27 16:34:14
|
<logmsgbot>
|
!log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
|
2023-03-27 16:34:20
|
<logmsgbot>
|
!log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
|
2023-03-27 16:34:31
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 16:34:35
|
<logmsgbot>
|
!log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
|
2023-03-27 16:36:05
|
<wikibugs>
|
('Merged) ''jenkins-bot: admin: Make sure resource quotas are honored for staging too [deployment-charts] - ''https://gerrit.wikimedia.org/r/903299 (owner: ''Alexandros Kosiaris)'
|
2023-03-27 16:39:09
|
<icinga-wm>
|
RECOVERY - Host db1150 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
|
2023-03-27 16:39:42
|
<logmsgbot>
|
!log akosiaris@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
|
2023-03-27 16:39:58
|
<logmsgbot>
|
!log akosiaris@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
|
2023-03-27 16:40:03
|
<akosiaris>
|
hashar: changeprop-jobqueue resource-quotas doubled
|
2023-03-27 16:40:12
|
<logmsgbot>
|
!log akosiaris@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
|
2023-03-27 16:40:59
|
<logmsgbot>
|
!log akosiaris@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
|
2023-03-27 16:43:45
|
<akosiaris>
|
sigh, pinged the wrong person, sorry Antoine
|
2023-03-27 16:43:50
|
<akosiaris>
|
hnowlan: changeprop-jobqueue resource-quotas doubled
|
2023-03-27 16:44:00
|
<wikibugs>
|
('PS1) ''Jbond: idm: remove auto restart for apache-htcacheclean [puppet] - ''https://gerrit.wikimedia.org/r/903302'
|
2023-03-27 16:44:28
|
<wikibugs>
|
('CR) ''Jbond: [V: ''+2 C: ''+2] idm: remove auto restart for apache-htcacheclean [puppet] - ''https://gerrit.wikimedia.org/r/903302 (owner: ''Jbond)'
|
2023-03-27 16:45:23
|
<wikibugs>
|
('PS1) ''Jbond: Revert "idm: remove auto restart for apache-htcacheclean" [puppet] - ''https://gerrit.wikimedia.org/r/903199'
|
2023-03-27 16:45:29
|
<wikibugs>
|
('CR) ''Jbond: [V: ''+2 C: ''+2] Revert "idm: remove auto restart for apache-htcacheclean" [puppet] - ''https://gerrit.wikimedia.org/r/903199 (owner: ''Jbond)'
|
2023-03-27 16:47:38
|
<wikibugs>
|
'ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (''Jhancock.wm) I logged into the ilo and as of now there are no errors on that link. Papaul pointed me to T330218 where he suggested moving the network port from ge-6/0/6 to ge-6/0/1. Since this issue comes back intermittently, i...'
|
2023-03-27 16:48:13
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''Dzahn) @jbond Is there a process for offboarding that describes how to do it correctly? As basically every single time I am trying to edit pwstore it is blocked by an invalid key...'
|
2023-03-27 16:49:29
|
<hnowlan>
|
akosiaris: thanks!
|
2023-03-27 16:49:57
|
<icinga-wm>
|
RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
|
2023-03-27 16:53:59
|
<icinga-wm>
|
PROBLEM - Host cloudlb2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
|
2023-03-27 16:54:23
|
<icinga-wm>
|
RECOVERY - Host cloudlb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms
|
2023-03-27 16:58:29
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''Dzahn) I removed both nfraison and eoghan from the .users file, re-signed it and then re-encrypted all files that I could encrypt, then pushed to repo. This does not change their...'
|
2023-03-27 16:59:40
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (9) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 17:00:05
|
<jouncebot>
|
Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1700)
|
2023-03-27 17:00:05
|
<jouncebot>
|
ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1700)
|
2023-03-27 17:05:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 17:06:26
|
<sukhe>
|
er what's this lvs failure, checking
|
2023-03-27 17:08:25
|
<sukhe>
|
hmm ran agent manually, resolved. must be transient
|
2023-03-27 17:15:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 17:19:43
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond)'
|
2023-03-27 17:19:57
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''jbond) ''In progress→''Resolved > @jbond Is there a process for offboarding that describes how to do it correctly? not really the [[ https://wikitech.wikimedia.org/wiki/SRE_Of...'
|
2023-03-27 17:20:35
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''Prod-Kubernetes, ''netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (''ayounsi) > Aside from duplication of code what are the blockers to having the Kubernetes groups also in Homer? Th...'
|
2023-03-27 17:23:59
|
<wikibugs>
|
('CR) ''Dzahn: "Should this be merged before the upgrade or should it wait until the upgrade?" [puppet] - ''https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049) (owner: ''Hashar)'
|
2023-03-27 17:25:26
|
<wikibugs>
|
('CR) ''Dzahn: "We already have bullseye doc machines thanks to Andrea's work. We should just switch to those." [puppet] - ''https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: ''Hashar)'
|
2023-03-27 17:27:55
|
<wikibugs>
|
('CR) ''Dzahn: "just means there will be a lot more rebasing because we keep adding to this. in that case it's easier to abandon it" [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 17:28:04
|
<wikibugs>
|
('Abandoned) ''Dzahn: monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - ''https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 17:31:52
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Move mbsantos and jgiannelos from parsoid-test-admins to parsoid-test-roots - https://phabricator.wikimedia.org/T333206 (''ssastry)'
|
2023-03-27 17:34:59
|
<wikibugs>
|
('PS1) ''Subramanya Sastry: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts [puppet] - ''https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206)'
|
2023-03-27 17:37:39
|
<wikibugs>
|
('CR) ''Slyngshede: [V: ''+1] P:url_downloader send Squid access logs to Logstash (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/903265 (owner: ''Slyngshede)'
|
2023-03-27 17:41:05
|
<wikibugs>
|
'SRE, ''ops-eqiad, ''DBA, ''Data-Persistence-Backup: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (''Cmjohnson) ''Open→''Resolved The DIMM has been replaced, I updated the idrac and bios while
it was offline.'
|
2023-03-27 17:45:13
|
<wikibugs>
|
('CR) ''Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 17:56:01
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 17:59:16
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 18:06:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on dnsbox cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dnsbox - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 18:07:35
|
<wikibugs>
|
('PS3) ''Andrea Denisse: doc: Add support for passive_hosts synchronization via rsync [puppet] - ''https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477)'
|
2023-03-27 18:09:33
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] doc: Add support for passive_hosts synchronization via rsync [puppet] - ''https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
|
2023-03-27 18:11:15
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''netops, ''cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (''Papaul) We moved the fist batch of servers today all went well.'
|
2023-03-27 18:15:56
|
<wikibugs>
|
('PS1) ''Dzahn: alertmanager: send sre-collab alerts to -operations and -sre-collab [puppet] - ''https://gerrit.wikimedia.org/r/903318 (https://phabricator.wikimedia.org/T329587)'
|
2023-03-27 18:16:26
|
<wikibugs>
|
('PS1) ''Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)'
|
2023-03-27 18:19:00
|
<wikibugs>
|
('CR) ''Andrea Denisse: [V: ''+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40349/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)
(owner: ''Andrea Denisse)'
|
2023-03-27 18:19:21
|
<wikibugs>
|
('PS2) ''Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)'
|
2023-03-27 18:20:23
|
<wikibugs>
|
('CR) ''Andrea Denisse: [V: ''+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40350/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)
(owner: ''Andrea Denisse)'
|
2023-03-27 18:20:42
|
<wikibugs>
|
('PS3) ''Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)'
|
2023-03-27 18:21:47
|
<wikibugs>
|
('CR) ''Andrea Denisse: [V: ''+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40351/console"; [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)
(owner: ''Andrea Denisse)'
|
2023-03-27 18:25:05
|
<wikibugs>
|
('PS4) ''Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)'
|
2023-03-27 18:28:50
|
<wikibugs>
|
('CR) ''Dzahn: "It looks ok in compiler, and I can check in devtools, but I don't want to get into follow-ups in deployment-prep and other projects." [puppet] - ''https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: ''Jaime Nuche)'
|
2023-03-27 18:29:05
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] deployment_server: ensure Docker is installed [puppet] - ''https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: ''Jaime Nuche)'
|
2023-03-27 18:31:27
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+1] "lgtm!" [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
|
2023-03-27 18:37:08
|
<wikibugs>
|
('PS14) ''Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - ''https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)'
|
2023-03-27 18:37:18
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''netops, ''cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (''Papaul) @cmooney second batch proposal below |Host|U space|Existing port|New port| |cloudcephosd2002-de...'
|
2023-03-27 18:37:51
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests, ''WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (''KFrancis) Hello, @oleksandr_tsyba_WMDE, I'll be helping with this request. Would you please send your WMDE email address to kfrancis@wikimedia.org?'
|
2023-03-27 18:41:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on dnsbox cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dnsbox - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 18:41:42
|
<wikibugs>
|
('CR) ''CI reject: [V: ''-1] Refactor and centralize BGPpeer config [deployment-charts] - ''https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: ''Ayounsi)'
|
2023-03-27 18:43:47
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] "confirmed noop on production deployment servers, deploy1002 and deploy2002 - fails in devtools, mostly expected" [puppet] - ''https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: ''Jaime Nuche)'
|
2023-03-27 18:45:35
|
<wikibugs>
|
'SRE, ''MediaWiki-extensions-OAuth, ''Datacenter-Switchover, ''Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (''larissagaulia)'
|
2023-03-27 18:45:48
|
<wikibugs>
|
('PS1) ''Dzahn: Revert "deployment_server: ensure Docker is installed" [puppet] - ''https://gerrit.wikimedia.org/r/903200'
|
2023-03-27 18:46:00
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] "unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives don't match any configuration option: st" [puppet] - ''https://gerrit.wikimedia.org/r/903200 (owner: ''Dzahn)'
|
2023-03-27 18:51:02
|
<wikibugs>
|
('CR) ''Dzahn: "what fails about them? or rather, what bothered you about them?" [puppet] - ''https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) (owner: ''Jcrespo)'
|
2023-03-27 18:52:43
|
<wikibugs>
|
('PS1) ''Dzahn: Revert "bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs" [puppet] - ''https://gerrit.wikimedia.org/r/903201'
|
2023-03-27 18:54:42
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] "puppet works again on deploy-1004.devtools after reverting so should be fine in deployment-prep as well" [puppet] - ''https://gerrit.wikimedia.org/r/903200 (owner: ''Dzahn)'
|
2023-03-27 18:56:30
|
<wikibugs>
|
('CR) ''Dzahn: "changes to admin groups might require access request tickets, this should be done between clinic duty and serviceops team. I don't have co" [puppet] - ''https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: ''Subramanya Sastry)'
|
2023-03-27 18:57:37
|
<wikibugs>
|
('CR) ''Dzahn: "If I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 first then we will still have IRC alerting as before." [puppet] - ''https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: ''Dzahn)'
|
2023-03-27 18:57:46
|
<wikibugs>
|
('CR) ''Dzahn: "If I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 first then we will still have IRC alerting as before." [puppet] - ''https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 19:06:56
|
<wikibugs>
|
('CR) ''Dzahn: zuul: fix up service enable and ensure (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
|
2023-03-27 19:07:48
|
<wikibugs>
|
('PS1) ''Jdlrobson: Expand list of wikis with language button at top. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777)'
|
2023-03-27 19:10:10
|
<wikibugs>
|
('PS1) ''Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)'
|
2023-03-27 19:11:17
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] "https://puppet-compiler.wmflabs.org/output/901576/40355/"; [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
|
2023-03-27 19:15:00
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] "confirmed this changed nothing on all 3 contint* servers. zuul is still running on contint2001, masked on contint1002 and unknown on conti" [puppet] - ''https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: ''Hashar)'
|
2023-03-27 19:18:29
|
<wikibugs>
|
('PS2) ''Jdlrobson: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093)'
|
2023-03-27 19:21:22
|
<logmsgbot>
|
!log ebernhardson@deploy2002 Started deploy [airflow-dags/search@3259099]: bump glent jar to 0.3.2
|
2023-03-27 19:21:37
|
<logmsgbot>
|
!log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@3259099]: bump glent jar to 0.3.2 (duration: 00m 14s)
|
2023-03-27 19:25:29
|
<wikibugs>
|
('PS1) ''Superpes15: Disable VisualEditor from talk namespace [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326'
|
2023-03-27 19:26:01
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 19:29:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 19:30:03
|
<wikibugs>
|
'SRE, ''Traffic, ''HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (''greg) Hi @BCornwall ! I'm just jumping in as an FR-Tech representative. I think I've got the summary here (basically, in the end, Shopify can't meet our hsts header needs which blocks overal...'
|
2023-03-27 19:40:47
|
<wikibugs>
|
('PS1) ''Dzahn: admin: add Barakat Ajadi as ldap_only_admin (wmf group) [puppet] - ''https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868)'
|
2023-03-27 19:43:25
|
<wikibugs>
|
('PS2) ''Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)'
|
2023-03-27 19:44:08
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (''Dzahn) @larissagaulia Thank you for adding the information. Does "until July" mean "until last day of June"? I uploaded a code change above that is now in review. Access requests...'
|
2023-03-27 19:45:13
|
<wikibugs>
|
('CR) ''Ahmon Dancy: Revert "deployment_server: ensure Docker is installed" (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/903200 (owner: ''Dzahn)'
|
2023-03-27 19:45:32
|
<wikibugs>
|
('PS3) ''Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)'
|
2023-03-27 19:46:20
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (''Dzahn) ''Open→''In progress p:''Triage→''Medium'
|
2023-03-27 19:46:31
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (''Dzahn) a:''Ladsgroup'
|
2023-03-27 19:49:11
|
<wikibugs>
|
('CR) ''Dzahn: "this may have caused that you can't include the docker class on new hosts anymore without a puppet error. : https://gerrit.wikimedia.org/r"; [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 19:52:02
|
<wikibugs>
|
('CR) ''Ahmon Dancy: k8s: Force docker storage-driver to overlay2 (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 19:56:01
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 19:59:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:00:04
|
<jouncebot>
|
RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2000).
|
2023-03-27 20:00:04
|
<jouncebot>
|
jdlrobson and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
|
2023-03-27 20:00:31
|
<Jdlrobson>
|
here
|
2023-03-27 20:00:47
|
<kindrobot>
|
I can deploy
|
2023-03-27 20:00:54
|
<Superpes>
|
Hi :)
|
2023-03-27 20:01:11
|
<kindrobot>
|
Jdlrobson: is it safe to deploy your two together?
|
2023-03-27 20:01:51
|
<Jdlrobson>
|
kindrobot: yep
|
2023-03-27 20:01:55
|
<kindrobot>
|
!log start UTC late backport window
|
2023-03-27 20:01:58
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 20:02:56
|
<wikibugs>
|
('PS1) ''Ahmon Dancy: k8s: Use storage-driver instead of storage_driver [puppet] - ''https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803)'
|
2023-03-27 20:04:58
|
<kindrobot>
|
Jdlrobson: what's modern-manpage?
|
2023-03-27 20:05:04
|
<wikibugs>
|
('CR) ''Ahmon Dancy: k8s: Force docker storage-driver to overlay2 (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: ''JMeybohm)'
|
2023-03-27 20:05:53
|
<kindrobot>
|
Oh, mainpage. I feel silly
|
2023-03-27 20:06:43
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C: ''+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) (owner: ''Jdlrobson)'
|
2023-03-27 20:06:46
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C: ''+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: ''Jdlrobson)'
|
2023-03-27 20:07:31
|
<wikibugs>
|
('Merged) ''jenkins-bot: Expand list of wikis with language button at top. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) (owner: ''Jdlrobson)'
|
2023-03-27 20:08:50
|
<wikibugs>
|
('PS3) ''Stef Dunlap: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: ''Jdlrobson)'
|
2023-03-27 20:09:03
|
<wikibugs>
|
('CR) ''TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: ''Jdlrobson)'
|
2023-03-27 20:09:47
|
<wikibugs>
|
('Merged) ''jenkins-bot: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: ''Jdlrobson)'
|
2023-03-27 20:10:02
|
<logmsgbot>
|
!log kindrobot@deploy2002 Started scap: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]]
|
2023-03-27 20:10:09
|
<stashbot>
|
T322093: Enable viewing of reading lists on web - https://phabricator.wikimedia.org/T322093
|
2023-03-27 20:10:10
|
<stashbot>
|
T331777: Add a header at the top of the Main page of French Wikiquote - https://phabricator.wikimedia.org/T331777
|
2023-03-27 20:11:29
|
<logmsgbot>
|
!log kindrobot@deploy2002 jdlrobson and kindrobot: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
|
2023-03-27 20:12:08
|
<kindrobot>
|
Jdlrobson: ready to check
|
2023-03-27 20:13:18
|
<Jdlrobson>
|
looking :)
|
2023-03-27 20:14:01
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1017.eqiad.wmnet
|
2023-03-27 20:14:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (13) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:14:33
|
<Jdlrobson>
|
kindrobot: LGTM!
|
2023-03-27 20:15:12
|
<kindrobot>
|
Great, syncing.
|
2023-03-27 20:15:53
|
<kindrobot>
|
Superpes is it safe to deploy your two patches together?
|
2023-03-27 20:16:01
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (14) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:16:44
|
<Superpes>
|
Yep no issue :) kindrobot
|
2023-03-27 20:17:12
|
<wikibugs>
|
('PS1) ''Andrew Bogott: Remove puppet references to cloudvirt1017, 1021, and 1022 [puppet] - ''https://gerrit.wikimedia.org/r/903330 (https://phabricator.wikimedia.org/T333169)'
|
2023-03-27 20:18:27
|
<wikibugs>
|
('CR) ''Andrew Bogott: [C: ''+2] Remove puppet references to cloudvirt1017, 1021, and 1022 [puppet] - ''https://gerrit.wikimedia.org/r/903330 (https://phabricator.wikimedia.org/T333169) (owner: ''Andrew Bogott)'
|
2023-03-27 20:19:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (15) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:19:23
|
<wikibugs>
|
('CR) ''Subramanya Sastry: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: ''Subramanya Sastry)'
|
2023-03-27 20:20:52
|
<logmsgbot>
|
!log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] (duration: 10m 50s)
|
2023-03-27 20:21:01
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (28) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:21:02
|
<stashbot>
|
T322093: Enable viewing of reading lists on web - https://phabricator.wikimedia.org/T322093
|
2023-03-27 20:21:02
|
<stashbot>
|
T331777: Add a header at the top of the Main page of French Wikiquote - https://phabricator.wikimedia.org/T331777
|
2023-03-27 20:21:25
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.dns.netbox
|
2023-03-27 20:22:36
|
<kindrobot>
|
Amir1: should we be worried about these systemd units failing before proceeding with the backports?
|
2023-03-27 20:23:20
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
|
2023-03-27 20:24:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:25:01
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
|
2023-03-27 20:25:01
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
|
2023-03-27 20:25:02
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1017.eqiad.wmnet
|
2023-03-27 20:25:20
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1021.eqiad.wmnet
|
2023-03-27 20:26:01
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:27:15
|
<Amir1>
|
kindrobot: which one is it?
|
2023-03-27 20:27:37
|
<Jdlrobson>
|
thanks kindrobot ! looking good on production!
|
2023-03-27 20:27:54
|
<Amir1>
|
the alert2001 one, it should be fine for now
|
2023-03-27 20:28:36
|
<kindrobot>
|
SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 | (SystemdUnitFailed) firing: (14) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100
|
2023-03-27 20:28:57
|
<taavi>
|
it's a new alert, I suspect it's actually failing for longer
|
2023-03-27 20:29:07
|
<Amir1>
|
there is way too many systemd unit fail, sigh https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:29:19
|
<Amir1>
|
it's ongoing for a while it seems
|
2023-03-27 20:29:44
|
<Amir1>
|
cwhite: maybe you know what's going on? speically on alert2001
|
2023-03-27 20:30:00
|
<kindrobot>
|
So would you advise continuing with the backports?
|
2023-03-27 20:31:04
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.dns.netbox
|
2023-03-27 20:31:28
|
<taavi>
|
kindrobot: I'd ignore the alerts and continue
|
2023-03-27 20:31:40
|
<kindrobot>
|
OK, thank you. :)
|
2023-03-27 20:31:46
|
<Amir1>
|
yeah, it's not related for sure
|
2023-03-27 20:32:51
|
<wikibugs>
|
('CR) ''EoghanGaffney: [C: ''+2] Assign insetup role to new aphlict vm [puppet] - ''https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: ''EoghanGaffney)'
|
2023-03-27 20:33:16
|
<wikibugs>
|
('PS4) ''Stef Dunlap: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: ''Superpes15)'
|
2023-03-27 20:33:16
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
|
2023-03-27 20:34:31
|
<cwhite>
|
Amir1: thanks for the heads up, I'll look into the auto restart failure
|
2023-03-27 20:35:13
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
|
2023-03-27 20:35:13
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
|
2023-03-27 20:35:14
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1021.eqiad.wmnet
|
2023-03-27 20:35:36
|
<wikibugs>
|
('PS2) ''Stef Dunlap: Disable VisualEditor from talk namespace [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326 (owner: ''Superpes15)'
|
2023-03-27 20:35:42
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C: ''+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326 (owner: ''Superpes15)'
|
2023-03-27 20:35:44
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C: ''+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: ''Superpes15)'
|
2023-03-27 20:36:10
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1022.eqiad.wmnet
|
2023-03-27 20:37:32
|
<wikibugs>
|
('Merged) ''jenkins-bot: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: ''Superpes15)'
|
2023-03-27 20:41:26
|
<wikibugs>
|
('CR) ''JMeybohm: [C: ''+1] "Very true. Sorry for causing trouble!" [puppet] - ''https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803) (owner: ''Ahmon Dancy)'
|
2023-03-27 20:41:49
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.dns.netbox
|
2023-03-27 20:43:50
|
<logmsgbot>
|
!log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
|
2023-03-27 20:45:04
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
|
2023-03-27 20:45:04
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
|
2023-03-27 20:45:05
|
<logmsgbot>
|
!log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1022.eqiad.wmnet
|
2023-03-27 20:45:19
|
<wikibugs>
|
'ops-eqiad, ''cloud-services-team, ''decommission-hardware: decommission cloudvirt1017.eqiad.wmnet and cloudvirt102[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T333169 (''Andrew) a:''Andrew→''Jclark-ctr'
|
2023-03-27 20:46:41
|
<wikibugs>
|
('PS1) ''Jgreen: payments-listener.frdev.wikimedia.org cname for new FR staging server [dns] - ''https://gerrit.wikimedia.org/r/903332 (https://phabricator.wikimedia.org/T285892)'
|
2023-03-27 20:50:29
|
<Amir1>
|
cwhite: it might be related but we are getting systemd unit fail on db1101 but the alert doesn't make sense
|
2023-03-27 20:50:47
|
<Amir1>
|
as it's really not failing
|
2023-03-27 20:51:03
|
<Amir1>
|
(maybe? I'll check)
|
2023-03-27 20:51:24
|
<cwhite>
|
From the auto-restart timer?
|
2023-03-27 20:51:25
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (13) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 20:52:08
|
<wikibugs>
|
'SRE, ''Traffic, ''HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (''BCornwall) Hey, @greg. It's not blocking overall improvement, it's just not [[ https://wikitech.wikimedia.org/wiki/HTTPS#Current_policies_and_standards | complying with standards ]]. Since
s...'
|
2023-03-27 20:52:36
|
<wikibugs>
|
('CR) ''Jgreen: [C: ''+2] payments-listener.frdev.wikimedia.org cname for new FR staging server [dns] - ''https://gerrit.wikimedia.org/r/903332 (https://phabricator.wikimedia.org/T285892) (owner: ''Jgreen)'
|
2023-03-27 20:55:34
|
<cwhite>
|
Amir1: db1101 is not an s7 host anymore?
|
2023-03-27 20:55:54
|
<Amir1>
|
probably Manuel moved it but he is not around
|
2023-03-27 20:56:10
|
<Amir1>
|
I think he said he reset the systemd timer
|
2023-03-27 20:57:42
|
<icinga-wm>
|
PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
|
2023-03-27 20:57:44
|
<icinga-wm>
|
PROBLEM - Host restbase1033 is DOWN: PING CRITICAL - Packet loss = 100%
|
2023-03-27 20:57:56
|
<cwhite>
|
I'm guessing there are auto-restart timers lingering that aren't being cleaned up by puppet.
|
2023-03-27 20:57:58
|
<icinga-wm>
|
PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1,
|
2023-03-27 20:57:58
|
<icinga-wm>
|
th aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
|
2023-03-27 20:59:36
|
<marostegui>
|
cwhite: it's not a S7 no
|
2023-03-27 20:59:40
|
<marostegui>
|
it's in M1
|
2023-03-27 21:00:03
|
<marostegui>
|
I disabled both systemd units
|
2023-03-27 21:00:05
|
<jouncebot>
|
Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2100).
|
2023-03-27 21:00:42
|
<kindrobot>
|
Note: the backport deploy window is still in progress
|
2023-03-27 21:01:06
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (4) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 21:01:12
|
<kindrobot>
|
taavi: it seems like its stalled out. It's cleared CI, but it hasn't merged
|
2023-03-27 21:02:25
|
<kindrobot>
|
https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/903326/
|
2023-03-27 21:02:50
|
<icinga-wm>
|
RECOVERY - Host restbase1033 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
|
2023-03-27 21:03:46
|
<icinga-wm>
|
PROBLEM - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
|
2023-03-27 21:04:38
|
<icinga-wm>
|
PROBLEM - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
|
2023-03-27 21:07:22
|
<kindrobot>
|
dancy: ^
|
2023-03-27 21:11:02
|
<tzatziki>
|
!log moving Universal Code of Conduct/Enforcement guidelines -> Universal Code of Conduct/Enforcement guidelines/Version 1 on metawiki with `extensions/Translate/scripts/moveTranslatableBundle.php `
|
2023-03-27 21:11:05
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 21:11:17
|
<taavi>
|
kindrobot: somehow the +2 was applied to PS1 while PS2 was the latest
|
2023-03-27 21:11:19
|
<tzatziki>
|
(probably don't need to log that but just in case)
|
2023-03-27 21:11:25
|
<Superpes>
|
Uh it doen't want to merge it
|
2023-03-27 21:11:51
|
<Superpes>
|
Oh
|
2023-03-27 21:12:17
|
<thcipriani>
|
hrm
|
2023-03-27 21:12:33
|
<taavi>
|
so just re-+2 it and probably file a bug in scap
|
2023-03-27 21:12:39
|
<kindrobot>
|
It should probably be OK to scap backport again, eh?
|
2023-03-27 21:12:42
|
<kindrobot>
|
OK.
|
2023-03-27 21:12:43
|
<wikibugs>
|
('CR) ''Andrea Denisse: [V: ''+1 C: ''+2] doc: Reserve UID/GID for the doc-uploader system user [puppet] - ''https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: ''Andrea Denisse)'
|
2023-03-27 21:12:48
|
<thcipriani>
|
+1
|
2023-03-27 21:13:10
|
<kindrobot>
|
Thank you all.
|
2023-03-27 21:13:17
|
<wikibugs>
|
('CR) ''TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326 (owner: ''Superpes15)'
|
2023-03-27 21:14:03
|
<jinxer-wm>
|
(ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
|
2023-03-27 21:14:12
|
<wikibugs>
|
('Merged) ''jenkins-bot: Disable VisualEditor from talk namespace [mediawiki-config] - ''https://gerrit.wikimedia.org/r/903326 (owner: ''Superpes15)'
|
2023-03-27 21:14:25
|
<logmsgbot>
|
!log kindrobot@deploy2002 Started scap: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]]
|
2023-03-27 21:14:31
|
<stashbot>
|
T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279
|
2023-03-27 21:14:54
|
<logmsgbot>
|
!log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@5f0eb44]: (no justification provided)
|
2023-03-27 21:15:08
|
<logmsgbot>
|
!log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@5f0eb44]: (no justification provided) (duration: 00m 13s)
|
2023-03-27 21:15:49
|
<logmsgbot>
|
!log kindrobot@deploy2002 kindrobot and superpes: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
|
2023-03-27 21:16:23
|
<kindrobot>
|
Ready to check Superpes
|
2023-03-27 21:17:02
|
<Superpes>
|
Checked both and everything is fine kindrobot! Thanks! :)
|
2023-03-27 21:17:22
|
<kindrobot>
|
Thanks, syncing
|
2023-03-27 21:18:28
|
<icinga-wm>
|
RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
|
2023-03-27 21:18:47
|
<wikibugs>
|
('CR) ''Dduvall: buildkitd: Isolate build container user/process/network namespaces (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: ''Dduvall)'
|
2023-03-27 21:19:03
|
<jinxer-wm>
|
(ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
|
2023-03-27 21:22:31
|
<wikibugs>
|
('CR) ''Dduvall: buildkitd: Isolate build container user/process/network namespaces (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: ''Dduvall)'
|
2023-03-27 21:22:51
|
<logmsgbot>
|
!log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] (duration: 08m 26s)
|
2023-03-27 21:22:57
|
<stashbot>
|
T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279
|
2023-03-27 21:23:40
|
<kindrobot>
|
Sync finished. Thanks everyone.
|
2023-03-27 21:23:52
|
<kindrobot>
|
!log finish UTC late backports
|
2023-03-27 21:23:56
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 21:24:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 21:24:16
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (30) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 21:24:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on wcqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wcqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 21:24:30
|
<icinga-wm>
|
RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
|
2023-03-27 21:24:56
|
<Amir1>
|
!log start of watchlist clean up in arwiki (T328501)
|
2023-03-27 21:24:59
|
<kindrobot>
|
Reedy, sbassett, Maryum, and manfredi backport window finished :)
|
2023-03-27 21:25:00
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 21:25:01
|
<stashbot>
|
T328501: Request to clean my watchlist from articles in namespace 0 and 1 - https://phabricator.wikimedia.org/T328501
|
2023-03-27 21:25:15
|
<Superpes>
|
Thanks for your time kindrobot :D
|
2023-03-27 21:25:53
|
<kindrobot>
|
No problem, thank you. :)
|
2023-03-27 21:26:16
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (37) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 21:27:38
|
<icinga-wm>
|
PROBLEM - Restbase root url on restbase1033 is CRITICAL: connect to address 10.64.48.71 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
|
2023-03-27 21:29:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (37) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 21:29:24
|
<icinga-wm>
|
PROBLEM - SSH on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
|
2023-03-27 21:30:56
|
<icinga-wm>
|
PROBLEM - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
|
2023-03-27 21:35:17
|
<wikibugs>
|
'SRE, ''vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (''andrea.denisse)'
|
2023-03-27 21:37:59
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (''Ladsgroup) superset should be automatically done via wmf ldap group. If Jgiannelos is in the ldap group, it should be done already. Correct?'
|
2023-03-27 21:39:35
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (''Ladsgroup) a:''Ladsgroup I'm on clinic duty this week. Waiting for signoff by Tyler. Maybe a deployment training can be arranged (or other devs in wmde can do an i...'
|
2023-03-27 21:40:02
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (''Ladsgroup) https://wikitech.wikimedia.org/wiki/Deployments/Training'
|
2023-03-27 21:42:50
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (''taavi) > To be able to access deplyed Wiki instances and ensure that wikibase (namely wikibase client) is properly installed and configured Unless you're also planni...'
|
2023-03-27 21:43:53
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (''Ottomata) Superset has its own 'roles', and I think something changed in a recent version that makes is so the default role doesn't have access to the SQL lab feat...'
|
2023-03-27 21:45:34
|
<ryankemper>
|
!log T330165 Depooled relevant search platform hosts: `sudo -E cumin 'elastic[1055-1056,1074-1079,1085-1086]*,cloudelastic100[2,6]*,wcqs1002*,wdqs[1007,1012]*' 'sudo depool'`
|
2023-03-27 21:45:39
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 21:45:40
|
<stashbot>
|
T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
|
2023-03-27 21:49:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (13) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 21:56:01
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (40) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 21:58:13
|
<urandom>
|
!log power cycling restbase1033 — T333243
|
2023-03-27 21:58:17
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 21:58:18
|
<stashbot>
|
T333243: restbase1033 is down - https://phabricator.wikimedia.org/T333243
|
2023-03-27 21:58:41
|
<maryum>
|
!log Deploy security fix for T326952
|
2023-03-27 21:58:45
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 21:59:11
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (63) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 22:01:08
|
<icinga-wm>
|
PROBLEM - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
|
2023-03-27 22:01:16
|
<icinga-wm>
|
PROBLEM - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
|
2023-03-27 22:01:16
|
<icinga-wm>
|
PROBLEM - cassandra-a SSL 10.64.48.151:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
|
2023-03-27 22:01:34
|
<icinga-wm>
|
RECOVERY - SSH on restbase1033 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
|
2023-03-27 22:01:38
|
<icinga-wm>
|
RECOVERY - Restbase root url on restbase1033 is OK: HTTP OK: HTTP/1.1 200 - 17255 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/RESTBase
|
2023-03-27 22:02:08
|
<icinga-wm>
|
PROBLEM - cassandra-b service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
2023-03-27 22:02:24
|
<icinga-wm>
|
PROBLEM - cassandra-c service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
2023-03-27 22:02:24
|
<icinga-wm>
|
PROBLEM - cassandra-a service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
2023-03-27 22:04:04
|
<icinga-wm>
|
RECOVERY - cassandra-b service on restbase1033 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
2023-03-27 22:04:18
|
<icinga-wm>
|
RECOVERY - cassandra-c service on restbase1033 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
2023-03-27 22:04:18
|
<icinga-wm>
|
RECOVERY - cassandra-a service on restbase1033 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
2023-03-27 22:04:46
|
<wikibugs>
|
('PS2) ''EoghanGaffney: Adds php and apache logs for doc machines [puppet] - ''https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245)'
|
2023-03-27 22:04:56
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on wcqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wcqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 22:05:18
|
<jinxer-wm>
|
(MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
2023-03-27 22:06:16
|
<wikibugs>
|
('CR) ''EoghanGaffney: Adds php and apache logs for doc machines (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: ''EoghanGaffney)'
|
2023-03-27 22:06:54
|
<icinga-wm>
|
RECOVERY - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-b valid until 2024-08-28 11:43:21 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
|
2023-03-27 22:07:00
|
<icinga-wm>
|
RECOVERY - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-c valid until 2024-08-28 11:43:23 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
|
2023-03-27 22:07:01
|
<icinga-wm>
|
RECOVERY - cassandra-a SSL 10.64.48.151:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-a valid until 2024-08-28 11:43:18 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
|
2023-03-27 22:07:02
|
<icinga-wm>
|
RECOVERY - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.151 port 9042 https://phabricator.wikimedia.org/T93886
|
2023-03-27 22:07:04
|
<icinga-wm>
|
RECOVERY - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.152 port 9042 https://phabricator.wikimedia.org/T93886
|
2023-03-27 22:07:22
|
<icinga-wm>
|
RECOVERY - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.153 port 9042 https://phabricator.wikimedia.org/T93886
|
2023-03-27 22:09:50
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''Volans) >>! In T330165#8731601, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/YxgIJY...'
|
2023-03-27 22:10:17
|
<jinxer-wm>
|
(MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
2023-03-27 22:14:43
|
<wikibugs>
|
('CR) ''Dzahn: "It's not true that this removes IRC notifications, they were just sent to a test channel only. I am fixing that here: https://gerrit.wikim"; [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 22:16:08
|
<zabe>
|
!log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Meta:WMF Support and Safety" "Meta:WMF Trust and Safety" "Zabe" --reason "per [[:phab:T330514|T330514]]" # T330514
|
2023-03-27 22:16:13
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 22:16:15
|
<stashbot>
|
T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
|
2023-03-27 22:17:18
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] alertmanager: send sre-collab alerts to -operations and -sre-collab [puppet] - ''https://gerrit.wikimedia.org/r/903318 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 22:18:06
|
<jinxer-wm>
|
(CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 287.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
|
2023-03-27 22:19:47
|
<wikibugs>
|
('PS1) ''Zabe: Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514)'
|
2023-03-27 22:21:42
|
<zabe>
|
MessageIndexException from line 191 of /srv/mediawiki/php-1.41.0-wmf.1/extensions/Translate/utils/MessageIndex.php: MessageIndex: unable to acquire lock
|
2023-03-27 22:21:46
|
<zabe>
|
:|
|
2023-03-27 22:22:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 22:22:27
|
<wikibugs>
|
'SRE, ''Infrastructure-Foundations, ''fundraising-tech-ops, ''netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (''Dwisehaupt) Thanks! Verified working and runs good.'
|
2023-03-27 22:22:54
|
<herzog>
|
zabe: I don't think we need to backport that to the current wmf branch, we can let the train do it when the time comes?
|
2023-03-27 22:23:34
|
<herzog>
|
oh, well, you're moving the meta pages now - I wanted to wait a bit
|
2023-03-27 22:24:05
|
<herzog>
|
well, we get this done now, good :)
|
2023-03-27 22:24:15
|
<zabe>
|
is there anything specific you wanted to wait for?
|
2023-03-27 22:24:55
|
<mutante>
|
runs puppet on bast1003 because that alert claims puppet fails on bastion "cluster" but also I dont get the graph :)
|
2023-03-27 22:27:02
|
<mutante>
|
and nothing actually failed there.. so no idea
|
2023-03-27 22:28:08
|
<mutante>
|
ah, it's bast5003 pushing things over the limit and the usual background ones https://puppetboard.wikimedia.org/nodes?status=failed
|
2023-03-27 22:29:15
|
<herzog>
|
zabe: my idea was train -> watch for failures -> rename; but since you are backporting it now, I guess there's no need to wait :)
|
2023-03-27 22:30:54
|
<zabe>
|
well :)
|
2023-03-27 22:31:10
|
<zabe>
|
jouncebot: nowandnext
|
2023-03-27 22:31:10
|
<jouncebot>
|
For the next 0 hour(s) and 28 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2100)
|
2023-03-27 22:31:10
|
<jouncebot>
|
In 3 hour(s) and 28 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0200)
|
2023-03-27 22:31:29
|
<wikibugs>
|
('CR) ''Zabe: [C: ''+2] Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) (owner: ''Zabe)'
|
2023-03-27 22:42:06
|
<jinxer-wm>
|
(CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic1074-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
|
2023-03-27 22:43:32
|
<mutante>
|
!log apt2001 - kill 3105; run puppet
|
2023-03-27 22:43:36
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 22:43:45
|
<mutante>
|
!log stat1004 - kill 29291; run puppet
|
2023-03-27 22:43:48
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 22:46:51
|
<wikibugs>
|
('Merged) ''jenkins-bot: Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - ''https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) (owner: ''Zabe)'
|
2023-03-27 22:47:06
|
<jinxer-wm>
|
(CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic1074-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
|
2023-03-27 22:47:20
|
<logmsgbot>
|
!log zabe@deploy2002 Started scap: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]]
|
2023-03-27 22:47:26
|
<stashbot>
|
T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
|
2023-03-27 22:48:34
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 22:48:48
|
<mutante>
|
!log stat1005 - kill 18179; run puppet ; stat1007 - kill 3346; run puppet ; stat1006 - kill 23887 run puppet
|
2023-03-27 22:48:52
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 22:52:06
|
<jinxer-wm>
|
(CirrusSearchNodeIndexingNotIncreasing) firing: (6) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
|
2023-03-27 22:57:06
|
<jinxer-wm>
|
(CirrusSearchNodeIndexingNotIncreasing) firing: (8) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
|
2023-03-27 23:00:01
|
<logmsgbot>
|
!log zabe@deploy2002 zabe: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
|
2023-03-27 23:00:11
|
<stashbot>
|
T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
|
2023-03-27 23:02:06
|
<jinxer-wm>
|
(CirrusSearchNodeIndexingNotIncreasing) firing: (9) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
|
2023-03-27 23:02:11
|
<wikibugs>
|
'SRE, ''Infrastructure Security, ''Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (''Dzahn) We got the "widespread puppet failures" alert which made me look at some random failed hosts in the list. I found the reason was this offboarding, because: apt2001: ` Err...'
|
2023-03-27 23:02:58
|
<jinxer-wm>
|
(KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 23:03:34
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (15) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 23:07:58
|
<jinxer-wm>
|
(KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
|
2023-03-27 23:08:34
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (22) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 23:08:48
|
<logmsgbot>
|
!log zabe@deploy2002 Finished scap: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] (duration: 21m 27s)
|
2023-03-27 23:08:54
|
<stashbot>
|
T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
|
2023-03-27 23:09:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 23:10:11
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] peopleweb: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 23:13:34
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (60) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 23:15:26
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 it's also not true anymore that this removes IRC notifications. they sho" [puppet] - ''https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: ''Dzahn)'
|
2023-03-27 23:17:29
|
<wikibugs>
|
'SRE, ''DBA, ''Data Pipelines, ''Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (''colewhite)'
|
2023-03-27 23:18:15
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] etherpad: replace Icinga with Prometheus monitoring [puppet] - ''https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 23:21:49
|
<wikibugs>
|
'SRE, ''ops-eqiad, ''Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (''wiki_willy) a:''Jclark-ctr'
|
2023-03-27 23:22:06
|
<jinxer-wm>
|
(CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
|
2023-03-27 23:22:35
|
<wikibugs>
|
'SRE, ''SRE-swift-storage, ''ops-eqiad, ''Analytics-Radar, ''DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (''wiki_willy) a:''Jclark-ctr'
|
2023-03-27 23:24:23
|
<wikibugs>
|
'SRE, ''SRE-swift-storage, ''ops-codfw, ''DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (''wiki_willy) a:''Papaul'
|
2023-03-27 23:24:26
|
<jinxer-wm>
|
(WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
|
2023-03-27 23:24:39
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (''Htriedman) @MoritzMuehlenhoff Sorry if this is a silly question, but I've been trying to run commands as `analytics-platform-eng` on stat machines by using `sudo -u analytics-platform-eng <cmd>...` and am b...'
|
2023-03-27 23:25:24
|
<wikibugs>
|
'ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (''wiki_willy) a:''Jhancock.wm'
|
2023-03-27 23:29:25
|
<wikibugs>
|
'SRE, ''SRE-swift-storage, ''ops-codfw, ''DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (''wiki_willy) Hi guys - can we confirm the firmware is all up to date? Thanks, Willy'
|
2023-03-27 23:31:12
|
<zabe>
|
!log deployed patch for T330968
|
2023-03-27 23:31:16
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 23:33:34
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (58) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 23:38:34
|
<jinxer-wm>
|
(SystemdUnitFailed) firing: (63) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
2023-03-27 23:42:22
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (''Dzahn) Hi @Htriedman and @MoritzMuehlenhoff, the answer to this riddle is that while the special user "`analytics-platform-eng`" exists on all stat* machines, the admin group `analytics-platform-eng-admin...'
|
2023-03-27 23:44:30
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] "working per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*people.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_inpu"; [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|
2023-03-27 23:44:42
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (''Ottomata) > @Htriedman I think this comes down to a new access request like "add analytics-platform-eng-admins on stat* hosts". Or ssh to an-airflow1004 and run your sudo cmd there :)'
|
2023-03-27 23:47:08
|
<mutante>
|
!log people1003 - taking down apache to provoke monitoring alert (inactive instances) and confirm IRC alerting change works
|
2023-03-27 23:47:11
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
2023-03-27 23:50:06
|
<mutante>
|
jinxer-wm: jinx it
|
2023-03-27 23:50:55
|
<jinxer-wm>
|
(ProbeDown) firing: Service people1003:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
|
2023-03-27 23:51:02
|
<icinga-wm>
|
PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
|
2023-03-27 23:51:23
|
<mutante>
|
oh, well, that worked but the Icinga part isnt gone
|
2023-03-27 23:51:31
|
<mutante>
|
it was supposed to replace that
|
2023-03-27 23:52:42
|
<icinga-wm>
|
RECOVERY - people.wikimedia.org requires authentication on people1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
|
2023-03-27 23:55:50
|
<jinxer-wm>
|
(ProbeDown) resolved: Service people1003:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
|
2023-03-27 23:59:07
|
<wikibugs>
|
('CR) ''Dzahn: [C: ''+2] "confirmed this reports on IRC on both channels and also created a ticket, as desired" [puppet] - ''https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: ''Dzahn)'
|