Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[求助/Help]v3.11.3=>v3.11.9 升级操作后telegraf和monitor异常 #21957

Open
chenjacken opened this issue Jan 11, 2025 · 5 comments
Open
Labels
question Further information is requested state/awaiting processing

Comments

@chenjacken
Copy link

chenjacken commented Jan 11, 2025

v3.11.3=>v3.11.9 升级操作后telegraf和monitor异常

telegraf日志:

[root@master1 ocboot]# kubectl -n onecloud get onecloudclusters default -o=jsonpath='{.spec.version}'
v3.11.9


[root@master1 ocboot]# kubectl logs default-telegraf-t777f -n onecloud
[info 250112 01:36:04 all.init.0(all.go:222)] init onecloud executor client, socket path: /hostfs/run/onecloud/exec.sock
2025-01-11T17:36:04Z I! Starting Telegraf 
2025-01-11T17:36:04Z E! [telegraf] Error running agent: Error: no outputs found, did you provide a valid config file?
[root@master1 ocboot]# 

monitor的日志:

[info 2025-01-11 17:43:12 cloudcommon.InitDB(database.go:122)] using inmemory lockman
[info 2025-01-11 17:43:12 db.CheckSync(models.go:116)] Start check database schema: autoSync(true), enableChecksumTables(false), skipInitChecksum(false)
[warning 2025-01-11 17:43:12 db.CheckSync(models.go:155)] table __default__-alerts_tbl-enabled-created_at-updated_at-update_version-deleted_at-deleted-id-description-is_emulated-name-status-progress-domain_id-tenant_id-frequency-settings-level-message-used_by-execution_error-for-eval_data-state-no_data_state-execution_error_state-last_state_change-state_changes-customize_config-res_type has been synced!
[warning 2025-01-11 17:43:12 db.CheckSync(models.go:155)] table __default__-alerts_tbl-enabled-created_at-updated_at-update_version-deleted_at-deleted-id-description-is_emulated-name-status-progress-domain_id-tenant_id-frequency-settings-level-message-used_by-execution_error-for-eval_data-state-no_data_state-execution_error_state-last_state_change-state_changes-customize_config-res_type has been synced!
[warning 2025-01-11 17:43:12 db.CheckSync(models.go:155)] table __default__-alerts_tbl-enabled-created_at-updated_at-update_version-deleted_at-deleted-id-description-is_emulated-name-status-progress-domain_id-tenant_id-frequency-settings-level-message-used_by-execution_error-for-eval_data-state-no_data_state-execution_error_state-last_state_change-state_changes-customize_config-res_type has been synced!
[info 2025-01-11 17:43:12 informer.(*EtcdBackendForClient).StartClientWatch(etcd_client.go:84)] /onecloud/informer watched
[info 2025-01-11 17:43:12 informer.NewWatchManagerBySessionBg.func1(watcher.go:51)] callback with watchMan success.
[info 2025-01-11 17:43:12 db.setDbConnection(database.go:60)] Total 27 db workers, set db connection max
[info 2025-01-11 17:43:12 service.startServices(service.go:113)] Initializing dataSourceManager
goroutine 170 [running]:
runtime/debug.Stack()
        /usr/lib/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
        /usr/lib/go/src/runtime/debug/stack.go:16 +0x13
yunion.io/x/log.Fatalf({0x235bad3, 0x1a}, {0xc00125bfa8, 0x2, 0x2})
        /root/go/src/yunion.io/x/onecloud/vendor/yunion.io/x/log/log.go:138 +0x2c
yunion.io/x/onecloud/pkg/monitor/service.startServices()
        /root/go/src/yunion.io/x/onecloud/pkg/monitor/service/service.go:115 +0x19f
created by yunion.io/x/onecloud/pkg/monitor/service.StartService in goroutine 1
        /root/go/src/yunion.io/x/onecloud/pkg/monitor/service/service.go:77 +0x134
[info 2025-01-11 17:43:12 worker.(*Worker).Start(worker.go:66)] start to get api Resource
[fatal 2025-01-11 17:43:12 service.startServices(service.go:115)] Service dataSourceManager init failed: get default TSDB source: [get internal service type "influxdb": catalog.GetServiceURLs: No such service influxdb: NotFoundError, get internal service type "victoria-metrics": catalog.GetServiceURLs: No such service victoria-metrics: NotFoundError]
[root@master1 ocboot]# 

victoria-metrics的POD

[root@master1 ~]# kubectl get pods -n onecloud |grep victoria-metrics
default-victoria-metrics-5d6b86fc9d-snjvs            1/1     Running            0          121m
[root@master1 ~]# 

[root@master1 ~]# kubectl logs -n onecloud $(kubectl get pods -n onecloud | grep monitor | awk '{print $1}') | grep 'TSDB data source'
[root@master1 ~]# 


[root@master1 ~]# climc endpoint-list --search victoria-metrics --details
+----------------------------------+-----------+----------------------------------+------------------+------------------+----------------------------------------+-----------+---------+
|                ID                | Region_ID |            Service_ID            |   Service_Name   |   Service_Type   |                  URL                   | Interface | Enabled |
+----------------------------------+-----------+----------------------------------+------------------+------------------+----------------------------------------+-----------+---------+
| d950494f9db547998937167798e4306f | region0   | 4374e8a091d5448b8c1c44d44cb4644d | victoria-metrics | victoria-metrics | https://172.16.1.200:30428             | public    | true    |
| af956cdacf0547168a558bbddc3f23ea | region0   | 4374e8a091d5448b8c1c44d44cb4644d | victoria-metrics | victoria-metrics | https://default-victoria-metrics:30428 | internal  | true    |
+----------------------------------+-----------+----------------------------------+------------------+------------------+----------------------------------------+-----------+---------+
***  Total: 2 Pages: 1 Limit: 20 Offset: 0 Page: 1  ***
[root@master1 ~]# 

麻烦帮忙看下,指导下如何解决。谢谢!

@chenjacken chenjacken added the question Further information is requested label Jan 11, 2025
@chenjacken chenjacken changed the title [求助/Help]v3.11.3=>v3.11.0 升级操作后telegraf和monitor异常 [求助/Help]v3.11.3=>v3.11.9 升级操作后telegraf和monitor异常 Jan 11, 2025
@zexi
Copy link
Member

zexi commented Jan 13, 2025

https://www.cloudpods.org/docs/operations/monitoring/migrating-to-vm 根据这个文档排查下,这些服务重启了吗?
kubectl get deployment -n onecloud | egrep 'region|monitor|meter|cloudmon|suggestion' | awk '{print $1}' | xargs kubectl rollout restart deployment -n onecloud

@chenjacken
Copy link
Author

chenjacken commented Jan 13, 2025

https://www.cloudpods.org/docs/operations/monitoring/migrating-to-vm 根据这个文档排查下,这些服务重启了吗? kubectl get deployment -n onecloud | egrep 'region|monitor|meter|cloudmon|suggestion' | awk '{print $1}' | xargs kubectl rollout restart deployment -n onecloud

这个文档有反复看了和检查过情况,刚才重启了对应的服务还是一样的问题,看日志还是一样,monitor没接上victoria-metrics

victoria-metrics的服务:

[root@master1 ~]# climc endpoint-list --search victoria-metrics --details
+----------------------------------+-----------+----------------------------------+------------------+------------------+----------------------------------------+-----------+---------+
|                ID                | Region_ID |            Service_ID            |   Service_Name   |   Service_Type   |                  URL                   | Interface | Enabled |
+----------------------------------+-----------+----------------------------------+------------------+------------------+----------------------------------------+-----------+---------+
| d950494f9db547998937167798e4306f | region0   | 4374e8a091d5448b8c1c44d44cb4644d | victoria-metrics | victoria-metrics | https://172.16.1.200:30428             | public    | true    |
| af956cdacf0547168a558bbddc3f23ea | region0   | 4374e8a091d5448b8c1c44d44cb4644d | victoria-metrics | victoria-metrics | https://default-victoria-metrics:30428 | internal  | true    |
+----------------------------------+-----------+----------------------------------+------------------+------------------+----------------------------------------+-----------+---------+
***  Total: 2 Pages: 1 Limit: 20 Offset: 0 Page: 1  ***
[root@master1 ~]# climc endpoint-list --search influxdb --details
***  Total: 0  ***
[root@master1 ~]# 


[root@master1 ~]# kubectl logs default-victoria-metrics-5d6b86fc9d-snjvs -n onecloud
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:12","msg":"build version: victoria-metrics-20231116-194416-tags-v1.95.1-0-g354563393"}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:13","msg":"command-line flags"}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -envflag.enable=\"true\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -envflag.prefix=\"VM_\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -httpListenAddr=\":30428\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -influx.databaseNames=\"telegraf,meter_db,monitor,system,mysql_metrics\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -loggerFormat=\"json\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -maxLabelsPerTimeseries=\"60\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -retentionPeriod=\"93d\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -storageDataPath=\"/storage\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -tls=\"true\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -tlsCertFile=\"/etc/yunion/pki/service.crt\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"  -tlsKeyFile=\"secret\""}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/app/victoria-metrics/main.go:70","msg":"starting VictoriaMetrics at \":30428\"..."}
{"ts":"2025-01-11T15:43:12.681Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/main.go:108","msg":"opening storage at \"/storage\" with -retentionPeriod=93d"}
{"ts":"2025-01-11T15:43:12.686Z","level":"info","caller":"VictoriaMetrics/lib/memory/memory.go:42","msg":"limiting caches to 10570904371 bytes, leaving 7047269581 bytes to the OS according to -memory.allowedPercent=60"}
{"ts":"2025-01-11T15:43:13.099Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/main.go:122","msg":"successfully opened storage \"/storage\" in 0.418 seconds; partsCount: 137; blocksCount: 3821879; rowsCount: 3969858126; sizeBytes: 5389479223"}
{"ts":"2025-01-11T15:43:13.102Z","level":"info","caller":"VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:115","msg":"loading rollupResult cache from \"/storage/cache/rollupResult\"..."}
{"ts":"2025-01-11T15:43:13.105Z","level":"info","caller":"VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:143","msg":"loaded rollupResult cache from \"/storage/cache/rollupResult\" in 0.003 seconds; entriesCount: 0, sizeBytes: 0"}
{"ts":"2025-01-11T15:43:13.105Z","level":"info","caller":"VictoriaMetrics/app/victoria-metrics/main.go:80","msg":"started VictoriaMetrics in 0.424 seconds"}
{"ts":"2025-01-11T15:43:13.105Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:101","msg":"starting http server at https://127.0.0.1:30428/"}
{"ts":"2025-01-11T15:43:13.105Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:102","msg":"pprof handlers are exposed at https://127.0.0.1:30428/debug/pprof/"}
[root@master1 ~]# 


[root@master1 ~]# kubectl logs default-telegraf-wpg55 -n onecloud
[info 250114 00:03:13 all.init.0(all.go:222)] init onecloud executor client, socket path: /hostfs/run/onecloud/exec.sock
2025-01-13T16:03:13Z I! Starting Telegraf 
2025-01-13T16:03:13Z E! [telegraf] Error running agent: Error: no outputs found, did you provide a valid config file?
[root@master1 ~]# 

default-monitor日志

_state_change-state_changes-customize_config-res_type has been synced!
[info 2025-01-13 15:51:01 informer.(*EtcdBackendForClient).StartClientWatch(etcd_client.go:84)] /onecloud/informer watched
[info 2025-01-13 15:51:01 informer.NewWatchManagerBySessionBg.func1(watcher.go:51)] callback with watchMan success.
[warning 2025-01-13 15:51:01 db.CheckSync(models.go:155)] table __default__-alerts_tbl-enabled-created_at-updated_at-update_version-deleted_at-deleted-id-description-is_emulated-name-status-progress-domain_id-tenant_id-frequency-settings-level-message-used_by-execution_error-for-eval_data-state-no_data_state-execution_error_state-last_state_change-state_changes-customize_config-res_type has been synced!
[info 2025-01-13 15:51:01 db.setDbConnection(database.go:60)] Total 27 db workers, set db connection max
[info 2025-01-13 15:51:01 service.startServices(service.go:113)] Initializing dataSourceManager
goroutine 33 [running]:
runtime/debug.Stack()
        /usr/lib/go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
        /usr/lib/go/src/runtime/debug/stack.go:16 +0x13
yunion.io/x/log.Fatalf({0x235bad3, 0x1a}, {0xc000fd7fa8, 0x2, 0x2})
        /root/go/src/yunion.io/x/onecloud/vendor/yunion.io/x/log/log.go:138 +0x2c
yunion.io/x/onecloud/pkg/monitor/service.startServices()
        /root/go/src/yunion.io/x/onecloud/pkg/monitor/service/service.go:115 +0x19f
created by yunion.io/x/onecloud/pkg/monitor/service.StartService in goroutine 1
        /root/go/src/yunion.io/x/onecloud/pkg/monitor/service/service.go:77 +0x134
[fatal 2025-01-13 15:51:01 service.startServices(service.go:115)] Service dataSourceManager init failed: get default TSDB source: [get internal service type "influxdb": catalog.GetServiceURLs: No such service influxdb: NotFoundError, get internal service type "victoria-metrics": catalog.GetServiceURLs: No such service victoria-metrics: NotFoundError]
[root@master1 ~]# 

一个集群内会同时启动2个monitor的POD?我看另外的一个集群只有一个default-monitor的pod.

[root@master1 ~]# kubectl get pods -n onecloud -owide |grep monitor
default-monitor-6b98f49fcc-tr944                     0/1     CrashLoopBackOff   4          2m10s   10.40.180.52    master2   <none>           <none>
default-monitor-6c96c95846-jz2sj                     0/1     CrashLoopBackOff   4          2m10s   10.40.180.39    master2   <none>           <none>
[root@master1 ~]# 

@zexi
Copy link
Member

zexi commented Jan 14, 2025

配置文件的 region 这些有修改吗? 看报错是 victoria-metrics 的 endpoint 没有找到,但看之前的 climc endpoint-list 在 region0 下面是有 victoria-metrics 的 endpoint 的

@chenjacken
Copy link
Author

记得没怎么修改过, climc service-config-edit region2的信息是:

default:
  auth_token_cache_size: 2048
  auto_reconcile_backup_servers: false
  auto_snapshot_day: 1
  auto_snapshot_hour: 2
  baremetal_prepare_package_url: https://172.16.1.200/baremetal-prepare/baremetal_prepare.tar.gz
  baremetal_server_reuse_host_ip: true
  calculate_quota_usage_interval_seconds: 900
  check_health_interval: 1
  check_scale_interval: 60
  clean_useless_kvm_security_group: false
  cloud_account_batch_sync_size: 10
  cloud_auto_sync_interval_seconds: 30
  cloud_images_sync_interval_hours: 3
  cloud_provider_sync_worker_count: 10
  cloud_sync_worker_count: 5
  cloudaccount_health_status_check: true
  concurrent_upper: 500
  convert_kubelet_docker_volume_size: 256g
  cron_job_worker_count: 4
  debug_client: false
  default_bandwidth: 1000
  default_bucket_quota: 100
  default_bw_quota: 2000000
  default_cache_quota: 10
  default_cloudaccount_quota: 20
  default_cpu_overcommit_bound: 8
  default_cpu_quota: 200
  default_disk_cache_mode: none
  default_disk_driver: scsi
  default_disk_size: 10240
  default_dns_zone_quota: 20
  default_ebw_quota: 4000
  default_eip_quota: 10
  default_eport_quota: 200
  default_globalvpc_quota: 10
  default_group_quota: 50
  default_host_quota: 500
  default_image_cache_dir: image_cache
  default_instance_snapshot_quota: 10
  default_ip_allocation_direction: stepdown
  default_isolated_device_quota: 200
  default_keypair_quota: 50
  default_loadbalancer_quota: 10
  default_max_manual_snapshot_count: 5
  default_max_snapshot_count: 9
  default_memory_overcommit_bound: 1
  default_memory_quota: 204800
  default_mongodb_quota: 10
  default_mtu: 1500
  default_network_gateway_address_esxi: 1
  default_object_cnt_quota: 5000
  default_object_gb_quota: 500
  default_port_quota: 200
  default_process_timeout_seconds: 60
  default_rds_quota: 10
  default_secgroup_quota: 50
  default_security_group_id: default
  default_server_quota: 50
  default_snapshot_quota: 10
  default_storage_overcommit_bound: 1
  default_storage_quota: 12288000
  default_sync_interval_seconds: 900
  default_vpc_external_access_mode: eip-distgw
  default_vpc_quota: 500
  delete_disks_expired_release: false
  delete_eip_expired_release: false
  delete_snapshot_expired_release: false
  disconnected_cloud_account_retry_probe_interval_hours: 2
  dns_domain: cloud.hwecc.com.cn
  dns_server: 172.16.1.200
  enable_auto_merge_security_group: false
  enable_auto_rename_project: false
  enable_auto_split_security_group: true
  enable_auto_switch_server_sku: false
  enable_esxi_swap: false
  enable_host_health_check: true
  enable_monitor_agent: false
  enable_pending_delete: true
  enable_pre_allocate_ip_addr: false
  enable_sync_name: true
enable_sync_purge: true
  enable_tls_migration: false
  expired_prepaid_max_clean_batch_size: 50
  force_use_origin_vnc: true
  global_mac_prefix: "00:22"
  guest_template_check_interval: 12
  historical_unique_name: false
  host_health_timeout: 60
  host_offline_detection_interval: 30
  host_offline_max_seconds: 180
  image_cache_storage_policy: least_used
  keep_deleted_snapshot_days: 30
  keep_tag_localization: false
  kvm_monitor_agent_use_metadata_service: true
  loadbalancer_pending_delete_check_interval: 3600
  local_data_disk_max_size_gb: 40960
  local_data_disk_min_size_gb: 10
  local_sys_disk_max_size_gb: 2048
  local_sys_disk_min_size_gb: 30
  lock_storage_from_cachedimage: false
  log_timestamp_format: "2006-01-02 15:04:05"
  log_with_time_zone: UTC
  managed_host_sync_status_interval_seconds: 300
  max_cloud_account_error_count: 5
  max_data_disk_count: 12
  max_managed_nic_countir_quota: 50
  default_loadbalancer_quota: 10
  default_max_manual_snapshot_count: 5
  default_max_snapshot_count: 9
  default_memory_overcommit_bound: 1
  default_memory_quota: 204800
  default_mongodb_quota: 10
  default_mtu: 1500
  default_network_gateway_address_esxi: 1
  default_object_cnt_quota: 5000
  default_object_gb_quota: 500
  default_port_quota: 200
  default_process_timeout_seconds: 60
 default_rds_quota: 10
  default_secgroup_quota: 50
  default_security_group_id: default
  default_server_quota: 50
  default_snapshot_quota: 10
  default_storage_overcommit_bound: 1
  default_storage_quota: 12288000
  default_sync_interval_seconds: 900
  default_vpc_external_access_mode: eip-distgw
  default_vpc_quota: 500
  delete_disks_expired_release: false
  delete_eip_expired_release: false
  delete_snapshot_expired_release: false
  disconnected_cloud_account_retry_probe_interval_hours: 2
  dns_domain: cloud.hwecc.com.cn
  dns_server: 172.16.1.200
  enable_auto_merge_security_group: false
  enable_auto_rename_project: false
  enable_auto_split_security_group: true
  enable_auto_switch_server_sku: false
  enable_esxi_swap: false
  enable_host_health_check: true
  enable_monitor_agent: false
  enable_pending_delete: true
  enable_pre_allocate_ip_addr: false
  enable_sync_name: true: 1
  max_normal_nic_count: 8
  metrics_retention_days: 30
  min_data_disk_count: 0
  min_nic_count: 1
  minimal_ip_addr_reused_interval_seconds: 30
  monitor_endpoint_type: public
  network_always_manual_config: false
  no_check_os_type_for_cached_image: false
  ovn_underlay_mtu: 1500
  pending_delete_check_seconds: 3600
  pending_delete_expire_seconds: 259200
  pending_delete_max_clean_batch_size: 50
  policy_worker_count: 1
  prepaid_auto_renew: true
  prepaid_auto_renew_hours: 3
prepaid_delete_expire_check: false
  prepaid_expire_check: false
  prepaid_expire_check_seconds: 600
  prohibit_refreshing_cloud_image: false
  query_offset_optimization: false
  rbac_debug: false
  rbac_policy_refresh_interval_seconds: 30
  reconcile_guest_backup_interval_seconds: 30
  repeat_weekdays_limit: 7
  request_worker_count: 8
  resource_expired_notify_days:
  - 1
  - 3
  - 30
  retention_days_limit: 49
  save_cloud_image_to_glance: true
  server_sku_sync_interval_minutes: 60
  server_status_sync_interval_minutes: 5
  set_kvm_server_as_daemon_on_create: true
  sku_batch_sync: 5
  sku_max_cpu_count: 256
  sku_max_mem_size: 1024
  snapshot_create_disk_protocol: fuse
  sync_ext_disk_snapshot_interval_minutes: 20
  sync_skus_day: 1
  sync_skus_hour: 3
  sync_storage_capacity_used_interval_minutes: 20
  system_admin_quota_check: false
  task_worker_count: 4
  tenant_cache_expire_seconds: 900
  time_points_limit: 1
  timer_interval: 60

@chenjacken
Copy link
Author

还有修复的方法吗?
@zexi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested state/awaiting processing
Projects
None yet
Development

No branches or pull requests

2 participants