Monitoring

K8s monitoring.

metrics-server

Clone metrics-server.

git clone https://github.com/kubernetes-incubator/metrics-server.git
cd metrics-server

Edit resource-reader.yaml.

nano deploy/1.8+/resource-reader.yaml

Edit the resources section as follows:

...
resources:
  - pods
  - nodes
  - namespaces
  - nodes/stats
...

Edit metrics-server-deployment.yaml

nano deploy/1.8+/metrics-server-deployment.yaml

Edit as follows:

...
      containers:
      - name: metrics-server
        image: k8s.gcr.io/metrics-server-amd64:v0.3.3
        command:
        - /metrics-server
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP
        imagePullPolicy: Always
...

Deploy it.

kubectl apply -f deploy/1.8+/

Wait a few minutes and run:

kubectl top node
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" |jq
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/YOUR-NAMESPACE/pods" |jq

References

https://medium.com/@cagri.ersen/kubernetes-metrics-server-installation-d93380de008

https://github.com/kubernetes-incubator/metrics-server/issues/247

http://d0o0bz.cn/2018/12/deploying-metrics-server-for-kubernetes/

Rancher

docker run \
  -tid \
  --name=rancher \
  --restart=unless-stopped \
  -p 80:80 -p 443:443 \
  rancher/rancher:latest

Add a cluster and run on you cluster the manifest it generates.

Also check: https://github.com/rancher/fleet

Audit

SSH to your master node.

Create a policy file:

mkdir /etc/kubernetes/policies
nano /etc/kubernetes/policies/audit-policy.yaml

Paste:

# Log all requests at the Metadata level.
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata

Edit K8s API server config file:

nano /etc/kubernetes/manifests/kube-apiserver.yaml

Add:

...
spec:
  containers:
  - command:
    - kube-apiserver
...
    - --audit-policy-file=/etc/kubernetes/policies/audit-policy.yaml
    - --audit-log-path=/var/log/apiserver/audit.log
    - --audit-log-format=json
...
    volumeMounts:
...
    - mountPath: /etc/kubernetes/policies
      name: policies
      readOnly: true
...
  volumes:
...
  - hostPath:
      path: /etc/kubernetes/policies
      type: DirectoryOrCreate
    name: policies

Restart kubelet:

systemctl restart kubelet

If the changes did not take effect, stop the API server docker container (it will be started automatically):

docker stop $(docker ps | grep "k8s_kube-apiserver_kube-apiserver-k8smaster_kube-system" | awk '{print $1}')

Tail the log file:

docker exec -it $(docker ps |grep "k8s_kube-apiserver_kube-apiserver-k8smaster_kube-system" | awk '{print $1}') tail -f /var/log/apiserver/audit.log

References

https://www.outcoldsolutions.com/docs/monitoring-kubernetes/v4/audit/

Prometheus

Create namespace

kubectl create namespace monitoring

Create Prometheus config

nano prometheus.yml

Paste:

global:
  scrape_interval:     15s
  external_labels:
    monitor: 'codelab-monitor'
scrape_configs:

  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

Create a ConfigMap from the config file:

kubectl -n monitoring create configmap cm-prometheus --from-file prometheus.yml

If you need to update the ConfigMap...

Edit the file:

nano prometheus.yml

Update the ConfigMap:

kubectl -n monitoring \
  create configmap cm-prometheus \
  --from-file=prometheus.yml \
  -o yaml --dry-run | kubectl apply -f -

Now we need to roll out the new ConfigMap. By the time of this writing (2019-02-15), this subjects seems to be a little tricky. Please find some options bellow:

Roll out ConfigMap: option 1 - scale deployment

This is the only way that will "always" work, although there will be a few seconds of downtime:

kubectl -n monitoring scale deployment/prometheus --replicas=0
kubectl -n monitoring scale deployment/prometheus --replicas=1

Roll out ConfigMap: option 2 - patch the deployment

kubectl -n monitoring \
  patch deployment prometheus \
  -p '{"spec":{"template":{"metadata":{"labels":{"date":"2019-02-15"}}}}}'

Roll out ConfigMap: option 3 - create a new ConfigMap

Create a new ConfigMap:

kubectl -n monitoring \
  create configmap cm-prometheus-new \
  --from-file=prometheus.yml \
  -o yaml --dry-run | kubectl apply -f -

Edit the deployment:

export EDITOR=nano
kubectl -n monitoring edit deployments prometheus

Edit volumes.configMap.name and use cm-prometheus-new. The change will force K8s to create new pods with the new config.

If by any reason you deployed Prometheus with hostNetwork: true, options 2 and 3 will return this error:

0/2 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) didn't match node selector.

In this case, use option 1.

If you need more info regarding rolling out ConfigMaps, please refer to: https://stackoverflow.com/questions/37317003/restart-pods-when-configmap-updates-in-kubernetes

https://github.com/kubernetes/kubernetes/issues/22368

Deploy Prometheus

SSH to the node which will host Prometheus and create a directory to persist its data:

mkdir -p /storage/storage-001/mnt-prometheus
chown -R nobody:nogroup /storage/storage-001/mnt-prometheus

Deploy Prometheus:

kubectl create -f - <<EOF

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      securityContext:
        runAsUser: 65534
        fsGroup: 65534
      containers:
      - name: prometheus
        image: prom/prometheus:latest
                   
        ports:
        - containerPort: 9090
        
        args:
        - --config.file=/etc/prometheus/prometheus.yml
        - --storage.tsdb.path=/prometheus
        - --web.console.libraries=/usr/share/prometheus/console_libraries
        - --web.console.templates=/usr/share/prometheus/consoles
        - --storage.tsdb.retention.time=90d

        volumeMounts:
          - name: config-volume
            mountPath: /etc/prometheus/prometheus.yml
            subPath: prometheus.yml
              
          - name: mnt-prometheus
            mountPath: /prometheus

      volumes:
        - name: config-volume
          configMap:
           name: cm-prometheus
           
        - name: mnt-prometheus
          hostPath:
            path: /storage/storage-001/mnt-prometheus
            
      nodeSelector:
        kubernetes.io/hostname: k8snode

EOF

Expose Prometheus

kubectl create -f - <<EOF
        
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: prometheus
  name: srv-prometheus
  namespace: monitoring
spec:
  externalTrafficPolicy: Cluster
  ports:
  - nodePort: 30909
    port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: prometheus
  sessionAffinity: None
  type: NodePort

EOF

Test the deployment

On your workstation access http://YOUR.CLUSTER.IP:30909

Alternatively you can port forward:

export NAMESPACE=monitoring
kubectl port-forward \
  -n $NAMESPACE \
  $(kubectl -n $NAMESPACE get pods |grep "prometheus-" | awk '{print $1}') \
  9090

Then access http://localhost:9090

References

https://sysdig.com/blog/kubernetes-monitoring-prometheus/

https://sysdig.com/blog/kubernetes-monitoring-with-prometheus-alertmanager-grafana-pushgateway-part-2/

https://sysdig.com/blog/kubernetes-monitoring-prometheus-operator-part3/

Manifest example

https://gist.github.com/philips/7ddeeb2fdab2ff4e4f8a035fc567f3d0

Grafana

Create namespace

kubectl create namespace monitoring

Create Grafana config

nano grafana.ini

Paste:

# ConfigMap
##################### Grafana Configuration Example #####################
#
# Everything has defaults so you only need to uncomment things you want to
# change

# possible values : production, development
;app_mode = production

# instance name, defaults to HOSTNAME environment variable value or hostname if HOSTNAME var is empty
;instance_name = ${HOSTNAME}

#################################### Paths ####################################
[paths]
# Path to where grafana can store temp files, sessions, and the sqlite3 db (if that is used)
;data = /var/lib/grafana

# Temporary files in `data` directory older than given duration will be removed
;temp_data_lifetime = 24h

# Directory where grafana can store logs
;logs = /var/log/grafana

# Directory where grafana will automatically scan and look for plugins
;plugins = /var/lib/grafana/plugins

# folder that contains provisioning config files that grafana will apply on startup and while running.
;provisioning = conf/provisioning

#################################### Server ####################################
[server]
# Protocol (http, https, socket)
;protocol = http

# The ip address to bind to, empty will bind to all interfaces
;http_addr =

# The http port  to use
;http_port = 3000

# The public facing domain name used to access grafana from a browser
;domain = localhost

# Redirect to correct domain if host header does not match domain
# Prevents DNS rebinding attacks
;enforce_domain = false

# The full public facing url you use in browser, used for redirects and emails
# If you use reverse proxy and sub path specify full url (with sub path)
;root_url = http://localhost:3000

# Log web requests
;router_logging = false

# the path relative working path
;static_root_path = public

# enable gzip
;enable_gzip = false

# https certs & key file
;cert_file =
;cert_key =

# Unix socket path
;socket =

#################################### Database ####################################
[database]
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url properties.

# Either "mysql", "postgres" or "sqlite3", it's your choice
;type = sqlite3
;host = 127.0.0.1:3306
;name = grafana
;user = root
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
;password =

# Use either URL or the previous fields to configure the database
# Example: mysql://user:secret@host:port/database
;url =

# For "postgres" only, either "disable", "require" or "verify-full"
;ssl_mode = disable

# For "sqlite3" only, path relative to data_path setting
;path = grafana.db

# Max idle conn setting default is 2
;max_idle_conn = 2

# Max conn setting default is 0 (mean not set)
;max_open_conn =

# Connection Max Lifetime default is 14400 (means 14400 seconds or 4 hours)
;conn_max_lifetime = 14400

# Set to true to log the sql calls and execution times.
log_queries =

#################################### Session ####################################
[session]
# Either "memory", "file", "redis", "mysql", "postgres", default is "file"
;provider = file

# Provider config options
# memory: not have any config yet
# file: session dir path, is relative to grafana data_path
# redis: config like redis server e.g. `addr=127.0.0.1:6379,pool_size=100,db=grafana`
# mysql: go-sql-driver/mysql dsn config string, e.g. `user:password@tcp(127.0.0.1:3306)/database_name`
# postgres: user=a password=b host=localhost port=5432 dbname=c sslmode=disable
;provider_config = sessions

# Session cookie name
;cookie_name = grafana_sess

# If you use session in https only, default is false
;cookie_secure = false

# Session life time, default is 86400
;session_life_time = 86400

#################################### Data proxy ###########################
[dataproxy]

# This enables data proxy logging, default is false
;logging = false

#################################### Analytics ####################################
[analytics]
# Server reporting, sends usage counters to stats.grafana.org every 24 hours.
# No ip addresses are being tracked, only simple counters to track
# running instances, dashboard and error counts. It is very helpful to us.
# Change this option to false to disable reporting.
;reporting_enabled = true

# Set to false to disable all checks to https://grafana.net
# for new vesions (grafana itself and plugins), check is used
# in some UI views to notify that grafana or plugin update exists
# This option does not cause any auto updates, nor send any information
# only a GET request to http://grafana.com to get latest versions
;check_for_updates = true

# Google Analytics universal tracking code, only enabled if you specify an id here
;google_analytics_ua_id =

#################################### Security ####################################
[security]
# default admin user, created on startup
;admin_user = admin

# default admin password, can be changed before first start of grafana,  or in profile settings
;admin_password = admin

# used for signing
;secret_key = SW2YcwTIb9zpOOhoPsMm

# Auto-login remember days
;login_remember_days = 7
;cookie_username = grafana_user
;cookie_remember_name = grafana_remember

# disable gravatar profile images
;disable_gravatar = false

# data source proxy whitelist (ip_or_domain:port separated by spaces)
;data_source_proxy_whitelist =

# disable protection against brute force login attempts
;disable_brute_force_login_protection = false

#################################### Snapshots ###########################
[snapshots]
# snapshot sharing options
;external_enabled = true
;external_snapshot_url = https://snapshots-origin.raintank.io
;external_snapshot_name = Publish to snapshot.raintank.io

# remove expired snapshot
;snapshot_remove_expired = true

#################################### Dashboards History ##################
[dashboards]
# Number dashboard versions to keep (per dashboard). Default: 20, Minimum: 1
;versions_to_keep = 20

#################################### Users ###############################
[users]
# disable user signup / registration
;allow_sign_up = true

# Allow non admin users to create organizations
;allow_org_create = true

# Set to true to automatically assign new users to the default organization (id 1)
;auto_assign_org = true

# Default role new users will be automatically assigned (if disabled above is set to true)
;auto_assign_org_role = Viewer

# Background text for the user field on the login page
;login_hint = email or username

# Default UI theme ("dark" or "light")
;default_theme = dark

# External user management, these options affect the organization users view
;external_manage_link_url =
;external_manage_link_name =
;external_manage_info =

# Viewers can edit/inspect dashboard settings in the browser. But not save the dashboard.
;viewers_can_edit = false

[auth]
# Set to true to disable (hide) the login form, useful if you use OAuth, defaults to false
;disable_login_form = false

# Set to true to disable the signout link in the side menu. useful if you use auth.proxy, defaults to false
;disable_signout_menu = false

# URL to redirect the user to after sign out
;signout_redirect_url =

# Set to true to attempt login with OAuth automatically, skipping the login screen.
# This setting is ignored if multiple OAuth providers are configured.
;oauth_auto_login = false

#################################### Anonymous Auth ##########################
[auth.anonymous]
# enable anonymous access
;enabled = false

# specify organization name that should be used for unauthenticated users
;org_name = Main Org.

# specify role for unauthenticated users
;org_role = Viewer

#################################### Github Auth ##########################
[auth.github]
;enabled = false
;allow_sign_up = true
;client_id = some_id
;client_secret = some_secret
;scopes = user:email,read:org
;auth_url = https://github.com/login/oauth/authorize
;token_url = https://github.com/login/oauth/access_token
;api_url = https://api.github.com/user
;team_ids =
;allowed_organizations =

#################################### Google Auth ##########################
[auth.google]
;enabled = false
;allow_sign_up = true
;client_id = some_client_id
;client_secret = some_client_secret
;scopes = https://www.googleapis.com/auth/userinfo.profile https://www.googleapis.com/auth/userinfo.email
;auth_url = https://accounts.google.com/o/oauth2/auth
;token_url = https://accounts.google.com/o/oauth2/token
;api_url = https://www.googleapis.com/oauth2/v1/userinfo
;allowed_domains =

#################################### Generic OAuth ##########################
[auth.generic_oauth]
;enabled = false
;name = OAuth
;allow_sign_up = true
;client_id = some_id
;client_secret = some_secret
;scopes = user:email,read:org
;auth_url = https://foo.bar/login/oauth/authorize
;token_url = https://foo.bar/login/oauth/access_token
;api_url = https://foo.bar/user
;team_ids =
;allowed_organizations =
;tls_skip_verify_insecure = false
;tls_client_cert =
;tls_client_key =
;tls_client_ca =

#################################### Grafana.com Auth ####################
[auth.grafana_com]
;enabled = false
;allow_sign_up = true
;client_id = some_id
;client_secret = some_secret
;scopes = user:email
;allowed_organizations =

#################################### Auth Proxy ##########################
[auth.proxy]
;enabled = false
;header_name = X-WEBAUTH-USER
;header_property = username
;auto_sign_up = true
;ldap_sync_ttl = 60
;whitelist = 192.168.1.1, 192.168.2.1
;headers = Email:X-User-Email, Name:X-User-Name

#################################### Basic Auth ##########################
[auth.basic]
;enabled = true

#################################### Auth LDAP ##########################
[auth.ldap]
;enabled = false
;config_file = /etc/grafana/ldap.toml
;allow_sign_up = true

#################################### SMTP / Emailing ##########################
[smtp]
;enabled = false
;host = localhost:25
;user =
# If the password contains # or ; you have to wrap it with trippel quotes. Ex """#password;"""
;password =
;cert_file =
;key_file =
;skip_verify = false
;from_address = admin@grafana.localhost
;from_name = Grafana
# EHLO identity in SMTP dialog (defaults to instance_name)
;ehlo_identity = dashboard.example.com

[emails]
;welcome_email_on_sign_up = false

#################################### Logging ##########################
[log]
# Either "console", "file", "syslog". Default is console and  file
# Use space to separate multiple modes, e.g. "console file"
;mode = console file

# Either "debug", "info", "warn", "error", "critical", default is "info"
;level = info

# optional settings to set different levels for specific loggers. Ex filters = sqlstore:debug
;filters =

# For "console" mode only
[log.console]
;level =

# log line format, valid options are text, console and json
;format = console

# For "file" mode only
[log.file]
;level =

# log line format, valid options are text, console and json
;format = text

# This enables automated log rotate(switch of following options), default is true
;log_rotate = true

# Max line number of single file, default is 1000000
;max_lines = 1000000

# Max size shift of single file, default is 28 means 1 << 28, 256MB
;max_size_shift = 28

# Segment log daily, default is true
;daily_rotate = true

# Expired days of log file(delete after max days), default is 7
;max_days = 7

[log.syslog]
;level =

# log line format, valid options are text, console and json
;format = text

# Syslog network type and address. This can be udp, tcp, or unix. If left blank, the default unix endpoints will be used.
;network =
;address =

# Syslog facility. user, daemon and local0 through local7 are valid.
;facility =

# Syslog tag. By default, the process' argv[0] is used.
;tag =

#################################### Alerting ############################
[alerting]
# Disable alerting engine & UI features
;enabled = true
# Makes it possible to turn off alert rule execution but alerting UI is visible
;execute_alerts = true

# Default setting for new alert rules. Defaults to categorize error and timeouts as alerting. (alerting, keep_state)
;error_or_timeout = alerting

# Default setting for how Grafana handles nodata or null values in alerting. (alerting, no_data, keep_state, ok)
;nodata_or_nullvalues = no_data

# Alert notifications can include images, but rendering many images at the same time can overload the server
# This limit will protect the server from render overloading and make sure notifications are sent out quickly
;concurrent_render_limit = 5

#################################### Explore #############################
[explore]
# Enable the Explore section
;enabled = false

#################################### Internal Grafana Metrics ##########################
# Metrics available at HTTP API Url /metrics
[metrics]
# Disable / Enable internal metrics
;enabled           = true

# Publish interval
;interval_seconds  = 10

# Send internal metrics to Graphite
[metrics.graphite]
# Enable by setting the address setting (ex localhost:2003)
;address =
;prefix = prod.grafana.%(instance_name)s.

#################################### Distributed tracing ############
[tracing.jaeger]
# Enable by setting the address sending traces to jaeger (ex localhost:6831)
;address = localhost:6831
# Tag that will always be included in when creating new spans. ex (tag1:value1,tag2:value2)
;always_included_tag = tag1:value1
# Type specifies the type of the sampler: const, probabilistic, rateLimiting, or remote
;sampler_type = const
# jaeger samplerconfig param
# for "const" sampler, 0 or 1 for always false/true respectively
# for "probabilistic" sampler, a probability between 0 and 1
# for "rateLimiting" sampler, the number of spans per second
# for "remote" sampler, param is the same as for "probabilistic"
# and indicates the initial sampling rate before the actual one
# is received from the mothership
;sampler_param = 1

#################################### Grafana.com integration  ##########################
# Url used to import dashboards directly from Grafana.com
[grafana_com]
;url = https://grafana.com

#################################### External image storage ##########################
[external_image_storage]
# Used for uploading images to public servers so they can be included in slack/email messages.
# you can choose between (s3, webdav, gcs, azure_blob, local)
;provider =

[external_image_storage.s3]
;bucket =
;region =
;path =
;access_key =
;secret_key =

[external_image_storage.webdav]
;url =
;public_url =
;username =
;password =

[external_image_storage.gcs]
;key_file =
;bucket =
;path =

[external_image_storage.azure_blob]
;account_name =
;account_key =
;container_name =

[external_image_storage.local]
# does not require any configuration

[rendering]
# Options to configure external image rendering server like https://github.com/grafana/grafana-image-renderer
;server_url =
;callback_url =

[enterprise]
# Path to a valid Grafana Enterprise license.jwt file
;license_path =

Create a ConfigMap from the config file:

kubectl -n monitoring create configmap cm-grafana --from-file grafana.ini

Create Grafana secrets

Generate base64 strings:

# This will be the admin-username. Copy the output.
echo -n 'admin' | base64

# This will be the admin-password. Copy the output.
echo -n 'PUT-YOUR-PASSWORD-HERE' | base64

Create Secret:

kubectl create -f - <<EOF

apiVersion: v1
kind: Secret
metadata:
  name: grafana
  namespace: monitoring
type: Opaque
data:
  admin-username: PASTE admin-username base64 HERE
  admin-password: PASTE admin-password base64 HERE
  
EOF

To retrieve admin username and password, run:

kubectl -n monitoring \
  get secret grafana \
  -o jsonpath="{.data.admin-username}" \
  | base64 --decode ; echo

kubectl -n monitoring \
  get secret grafana \
  -o jsonpath="{.data.admin-password}" \
  | base64 --decode ; echo

Deploy Grafana

SSH to the node which will host Prometheus and create a directory to persist its data:

mkdir -p /storage/storage-001/mnt-grafana
chown -R nobody:nogroup /storage/storage-001/mnt-grafana

Deploy Grafana:

kubectl create -f - <<EOF

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      securityContext:
        runAsUser: 65534 #nobody
        fsGroup: 65534 #nogroup
      containers:
      - name: grafana
        image: grafana/grafana
        
        ports:
        - containerPort: 3000
        
        env:
          - name: GF_AUTH_BASIC_ENABLED
            value: "true"
            
          - name: GF_SECURITY_ADMIN_USER
            #value: "admin"
            valueFrom:
              secretKeyRef:
                name: grafana
                key: admin-username
            
          - name: GF_SECURITY_ADMIN_PASSWORD
            #value: "PLAIN-PWD"
            valueFrom:
              secretKeyRef:
                name: grafana
                key: admin-password
            
          #- name: GF_AUTH_ANONYMOUS_ENABLED
          #  value: "false"
          
          # If you want allow anonymous admin acess use the following
          # config instead  
          #- name: GF_AUTH_BASIC_ENABLED
          #  value: "false"
          #- name: GF_AUTH_ANONYMOUS_ENABLED
          #  value: "true"
          #- name: GF_AUTH_ANONYMOUS_ORG_ROLE
          #  value: Admin
          
        volumeMounts:
          - name: config-volume
            mountPath: /etc/grafana/grafana.ini
            subPath: grafana.ini
            
          - name: mnt-grafana
            mountPath: /var/lib/grafana
            
      volumes:
        - name: config-volume
          configMap:
           name: cm-grafana
           
        - name: mnt-grafana
          hostPath:
            path: /storage/storage-001/mnt-grafana

      nodeSelector:
        kubernetes.io/hostname: k8snode

EOF

Expose Grafana

kubectl create -f - <<EOF
        
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: grafana
  name: srv-grafana
  namespace: monitoring
spec:
  externalTrafficPolicy: Cluster
  ports:
  - nodePort: 30000
    port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app: grafana
  sessionAffinity: None
  type: NodePort

EOF

Test the deployment

On your workstation access http://YOUR.CLUSTER.IP:30000

Alternatively you can port forward:

export NAMESPACE=monitoring
kubectl port-forward \
  -n $NAMESPACE \
  $(kubectl -n $NAMESPACE get pods |grep "grafana-" | awk '{print $1}') \
  3000

Then access http://localhost:9090

Dashboards

https://grafana.com/dashboards/2115

Prometheus exporters

node-exporter

Create a DaemonSet to ensure all nodes have node-exporter:

kubectl create -f - <<EOF

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    name: node-exporter
spec:
  template:
    metadata:
      labels:
        name: node-exporter
      annotations:
         prometheus.io/scrape: "true"
         prometheus.io/port: "9100"
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
        - ports:
            - containerPort: 9100
              protocol: TCP
          resources:
            requests:
              cpu: 0.15
          securityContext:
            privileged: true
          image: prom/node-exporter:latest
          args:
            - --path.procfs
            - /host/proc
            - --path.sysfs
            - /host/sys
            - --collector.filesystem.ignored-mount-points
            - '"^/(sys|proc|dev|host|etc)($|/)"'
          name: node-exporter
          volumeMounts:
            - name: dev
              mountPath: /host/dev
            - name: proc
              mountPath: /host/proc
            - name: sys
              mountPath: /host/sys
            - name: rootfs
              mountPath: /rootfs
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: dev
          hostPath:
            path: /dev
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /
    
EOF

Add node-exporter scraper to Prometheus

Edit Prometheus config file:

nano prometheus.yml

Add the scraper:

- job_name: 'node_exporter_test'
    static_configs:
    - targets: ['YOUR-NODE-IP:9100']
    #relabel_configs:
    #  - source_labels: [__address__]
    #    target_label: instance
    #    replacement: "NEW-LABEL"
    #relabel_configs:
    #  - source_labels: [__address__]
    #    target_label: __address__
    #    replacement: k8snode:9100
    #metric_relabel_configs:
    #  - source_labels: ["__name__"]
    #    target_label: "job"
    #    replacement: "job"

Grafana dashboard

ID: 1860

https://grafana.com/dashboards/1860

kube-state-metrics

Deploy dependencies:

git clone https://github.com/kubernetes/kube-state-metrics.git
kubectl apply -f kube-state-metrics/kubernetes/

Expose kube-state-metrics

kubectl create -f - <<EOF
        
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: prometheus
  name: srv-custom-kube-state-metrics
  namespace: kube-system
spec:
  externalTrafficPolicy: Cluster
  ports:
  - nodePort: 32767
    name: metrics
    port: 8080
    protocol: TCP
    targetPort: 8080
  - nodePort: 32766
    name: telemetry
    port: 8081
    protocol: TCP
    targetPort: 8081
  selector:
    k8s-app: kube-state-metrics
  sessionAffinity: None
  type: NodePort

EOF

Add Prometheus scraper

  - job_name: 'kube-state-metrics-metrics'
    static_configs:
    - targets: ['NODE.IP:32767']
    
  - job_name: 'kube-state-metrics-telemetry'
    static_configs:
    - targets: ['NODE.IP:32766']

Update the ConfigMap:

kubectl -n monitoring \
  create configmap cm-prometheus \
  --from-file=prometheus.yml \
  -o yaml --dry-run | kubectl apply -f -

Roll out ConfigMap:

kubectl -n monitoring scale deployment/prometheus --replicas=0
kubectl -n monitoring scale deployment/prometheus --replicas=1

Grafana dashboard

Dashboard ID: 7249

https://grafana.com/dashboards/7249

Dashboard ID: 747

https://grafana.com/dashboards/747

Grafana panels

{
  "columns": [],
  "fontSize": "100%",
  "gridPos": {
    "h": 9,
    "w": 12,
    "x": 0,
    "y": 0
  },
  "id": 2,
  "links": [],
  "pageSize": null,
  "scroll": true,
  "showHeader": true,
  "sort": {
    "col": 2,
    "desc": true
  },
  "styles": [
    {
      "alias": "Time",
      "dateFormat": "YYYY-MM-DD HH:mm:ss",
      "pattern": "Time",
      "type": "date"
    },
    {
      "alias": "",
      "colorMode": null,
      "colors": [
        "rgba(245, 54, 54, 0.9)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(50, 172, 45, 0.97)"
      ],
      "decimals": 2,
      "pattern": "/.*/",
      "thresholds": [],
      "type": "number",
      "unit": "short"
    }
  ],
  "targets": [
    {
      "expr": "sum(kube_pod_container_status_restarts_total{namespace=~\"^$namespace$\",pod=~\"^$pod$\"}) by (pod)",
      "format": "table",
      "intervalFactor": 1,
      "refId": "A"
    }
  ],
  "title": "Pod restart history",
  "transform": "table",
  "type": "table"
}
{
  "columns": [],
  "fontSize": "100%",
  "gridPos": {
    "h": 9,
    "w": 12,
    "x": 0,
    "y": 0
  },
  "id": 2,
  "links": [],
  "pageSize": null,
  "scroll": true,
  "showHeader": true,
  "sort": {
    "col": 5,
    "desc": true
  },
  "styles": [
    {
      "alias": "Time",
      "dateFormat": "YYYY-MM-DD HH:mm:ss",
      "pattern": "Time",
      "type": "date"
    },
    {
      "alias": "",
      "colorMode": null,
      "colors": [
        "rgba(245, 54, 54, 0.9)",
        "rgba(237, 129, 40, 0.89)",
        "rgba(50, 172, 45, 0.97)"
      ],
      "decimals": 2,
      "pattern": "/.*/",
      "thresholds": [],
      "type": "number",
      "unit": "short"
    }
  ],
  "targets": [
    {
      "expr": "sum(kube_pod_container_status_restarts_total{namespace=~\"^$namespace$\",pod=~\"^$pod$\"}) by (namespace, pod, container, job)",
      "format": "table",
      "intervalFactor": 1,
      "refId": "A",
      "legendFormat": "",
      "interval": "",
      "instant": false
    }
  ],
  "title": "Pod restart history",
  "transform": "table",
  "type": "table"
}

nvidia-gpu-exporter

Label your nodes:

kubectl label nodes PUT-YOUR-NODE-HERE hardware-type=NVIDIAGPU

Deploy it:

kubectl create -f - <<EOF

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-gpu-exporter
  namespace: monitoring
  labels:
    app: nvidia-gpu-exporter
    component: nvidia-gpu-exporter
spec:
  template:
    metadata:
      name: nvidia-gpu-exporter
      labels:
        app: prometheus
        component: gpu-exporter
    spec:
      containers:
      - image: swiftdiaries/gpu_prom_metrics
        name: nvidia-gpu-exporter
        ports:
        - name: prom-gpu-exp
          containerPort: 9445
          hostPort: 9445
      hostNetwork: true
      nodeSelector:
        hardware-type: "NVIDIAGPU"

EOF

Prometheus scraper:

  - job_name: 'gpu'
    static_configs:
    - targets: ['NODE.IP:9445']

Update config map following the instructions above.

Grafana dashboard:

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 3,
  "iteration": 1553034339729,
  "links": [],
  "panels": [
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": false,
      "colors": [
        "#299c46",
        "rgba(237, 129, 40, 0.89)",
        "#d44a3a"
      ],
      "format": "none",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": false,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 7,
        "w": 2,
        "x": 0,
        "y": 0
      },
      "id": 2,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "nvidia_gpu_num_devices{instance=\"$node:9445\"}",
          "format": "time_series",
          "intervalFactor": 1,
          "refId": "A"
        }
      ],
      "thresholds": "",
      "timeFrom": null,
      "timeShift": null,
      "title": "GPUs",
      "type": "singlestat",
      "valueFontSize": "80%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "avg"
    },
    {
      "cacheTimeout": null,
      "colorBackground": false,
      "colorValue": false,
      "colors": [
        "#299c46",
        "rgba(237, 129, 40, 0.89)",
        "#d44a3a"
      ],
      "datasource": "prometheus-k8s",
      "format": "none",
      "gauge": {
        "maxValue": 100,
        "minValue": 0,
        "show": true,
        "thresholdLabels": false,
        "thresholdMarkers": true
      },
      "gridPos": {
        "h": 7,
        "w": 5,
        "x": 2,
        "y": 0
      },
      "id": 10,
      "interval": null,
      "links": [],
      "mappingType": 1,
      "mappingTypes": [
        {
          "name": "value to text",
          "value": 1
        },
        {
          "name": "range to text",
          "value": 2
        }
      ],
      "maxDataPoints": 100,
      "nullPointMode": "connected",
      "nullText": null,
      "postfix": "",
      "postfixFontSize": "50%",
      "prefix": "",
      "prefixFontSize": "50%",
      "rangeMaps": [
        {
          "from": "null",
          "text": "N/A",
          "to": "null"
        }
      ],
      "sparkline": {
        "fillColor": "rgba(31, 118, 189, 0.18)",
        "full": false,
        "lineColor": "rgb(31, 120, 193)",
        "show": false
      },
      "tableColumn": "",
      "targets": [
        {
          "expr": "nvidia_gpu_temperature_celsius{instance=\"$node:9445\",minor_number=\"$gpu\"}",
          "format": "time_series",
          "intervalFactor": 1,
          "refId": "A"
        }
      ],
      "thresholds": "33,66,100",
      "title": "Temperature (C)",
      "type": "singlestat",
      "valueFontSize": "80%",
      "valueMaps": [
        {
          "op": "=",
          "text": "N/A",
          "value": "null"
        }
      ],
      "valueName": "current"
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "fill": 1,
      "gridPos": {
        "h": 7,
        "w": 19,
        "x": 0,
        "y": 7
      },
      "id": 4,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "paceLength": 10,
      "percentage": false,
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "nvidia_gpu_memory_total_bytes{instance=\"$node:9445\",minor_number=\"$gpu\"}/100000000",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "Total",
          "refId": "A"
        },
        {
          "expr": "nvidia_gpu_memory_used_bytes{instance=\"$node:9445\",minor_number=\"$gpu\"}/100000000",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "Used",
          "refId": "B"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Memory (MB)",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "fill": 1,
      "gridPos": {
        "h": 7,
        "w": 19,
        "x": 0,
        "y": 14
      },
      "id": 6,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "paceLength": 10,
      "percentage": false,
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "nvidia_gpu_duty_cycle{instance=\"$node:9445\",minor_number=\"$gpu\"}",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "Total",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "GPU Utilisation (%)",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "fill": 1,
      "gridPos": {
        "h": 7,
        "w": 19,
        "x": 0,
        "y": 21
      },
      "id": 8,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "paceLength": 10,
      "percentage": false,
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "nvidia_gpu_power_usage_milliwatts{instance=\"$node:9445\",minor_number=\"$gpu\"}/1000",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "Total",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Power Usage (watts)",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 18,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "allValue": null,
        "current": {
          "selected": true,
          "text": "hamilton.cv-prod-nz-air-new-zealand.com",
          "value": "hamilton.cv-prod-nz-air-new-zealand.com"
        },
        "datasource": "prometheus-k8s",
        "definition": "nvidia_gpu_power_usage_milliwatts",
        "hide": 0,
        "includeAll": false,
        "label": "Host:",
        "multi": false,
        "name": "node",
        "options": [],
        "query": "nvidia_gpu_power_usage_milliwatts",
        "refresh": 1,
        "regex": "/.*instance=\"([^\"]*):.*/",
        "skipUrlSync": false,
        "sort": 0,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      },
      {
        "allValue": null,
        "current": {
          "tags": [],
          "text": "1",
          "value": "1"
        },
        "datasource": "prometheus-k8s",
        "definition": "nvidia_gpu_temperature_celsius",
        "hide": 0,
        "includeAll": false,
        "label": "GPU:",
        "multi": false,
        "name": "gpu",
        "options": [],
        "query": "nvidia_gpu_temperature_celsius",
        "refresh": 1,
        "regex": "/minor_number=\"(.*?)\"/",
        "skipUrlSync": false,
        "sort": 0,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      }
    ]
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ],
    "time_options": [
      "5m",
      "15m",
      "1h",
      "6h",
      "12h",
      "24h",
      "2d",
      "7d",
      "30d"
    ]
  },
  "timezone": "",
  "title": "GPU",
  "uid": "oaFpztCmk",
  "version": 8
}

References

https://github.com/mindprince/nvidia_gpu_prometheus_exporter

https://github.com/andreyvelich/nvidia_gpu_prometheus_exporter

prometheus-operator

These instuctions are not working properly due Persistent Volume issues. They are here only as reference.

Install

Create a StorageClass

Create the manifest:

cat > storage-class.yml <<EOF
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

EOF

Deploy it:

kubectl create -f storage-class.yml

Install the helm chart

Create the manifest:

cat > custom-values.yaml <<EOF

# Depending on which DNS solution you have installed in your cluster enable the right exporter
coreDns:
  enabled: false

kubeDns:
  enabled: true

alertmanager:
  alertmanagerSpec:
    nodeSelector:
      kubernetes.io/hostname: minikube
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: local-storage
          resources:
            requests:
              storage: 10Gi

prometheus:
  prometheusSpec:
    nodeSelector:
      kubernetes.io/hostname: minikube
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: local-storage
          resources:
            requests:
              storage: 10Gi

prometheusOperator:
  nodeSelector:
    kubernetes.io/hostname: minikube

grafana:
  adminPassword: "YourPass123#"
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
    hosts:
      - grafana.test.akomljen.com
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.test.akomljen.com
  persistence:
    enabled: true
    accessModes: ["ReadWriteOnce"]
    storageClassName: local-storage
    size: 10Gi

EOF

Make sure you have helm and tiller set up.

Deploy prometheus-operator:

helm install \
  --tls \
  --name prom \
  --namespace monitoring\
  -f custom-values.yaml \
  stable/prometheus-operator

If you need to update the custom values, run:

helm upgrade -f custom-values.yaml /
  prom stable/prometheus-operator

Check statuses:

kubectl --namespace monitoring get pods -l "release=prom"

Port forwarding

Prometheus:

kubectl port-forward \
  -n monitoring \
  prometheus-prom-prometheus-operator-prometheus-0 9090

Alert manager:

kubectl port-forward \
  -n monitoring alertmanager-prom-prometheus-operator-alertmanager-0 9093

Grafana:

kubectl port-forward \
  -n monitoring \
  $(kubectl -n monitoring get pods |grep "prom-grafana" | awk '{print $1}') \
  3000:3000

References

https://github.com/helm/charts/tree/master/stable/prometheus-operator

https://github.com/coreos/prometheus-operator

https://akomljen.com/get-kubernetes-cluster-metrics-with-prometheus-in-5-minutes/

https://www.sachsenhofer.io/setup-prometheus-operator-kube-prometheus-kubernetes-cluster/

Uninstall

To uninstall/delete:

helm delete --purge prom
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete namespace monitoring

Custom values

Check:

https://github.com/helm/charts/tree/master/stable/prometheus-operator#configuration

https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md

Last updated