Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

User guide

Pre-requisites:

I. Check the availability of MIG Profiles.

Note
Currently these MIG Profiles are created.

II. Check the resource given to you.


$ kubectl describe quota
Note
If a certain MIG profile is not available or set up, choose the available MIG Profiles or request
your admin for the same.

III. Check the availability of particular NVIDIA Docker images you want to
use(administrator needs to update)
$ docker images | grep nvcr.io
$ docker images | grep boot

Note
For this BOOTCAMP: use the image: iit_bootcamp/pyt:21.08
use this image: iit_bootcamp/pyt:21.09
This step is followed/updated by the administrator.
If a particular docker image is not available, request your administrator to pull that image.
STEPS TO CREATE KUBERNETES DEPLOYMENT/JOB:

1. Login to the user account

2. Check your present directory


$ pwd

3. Check gpu resources / user quota


# Based on your quota, use that MIG profile(like 1g.5gb, 2g.10gb ,...) to launch
an instance
$ kubectl get quota

4. Launch jupyter notebook


$ launch-jupyter-mig
Following are the scripts present and their purpose:

a. launch-job
i. To launch JOB to run python script.
b. launch-jupyter
i. To launch a Jupyter notebook instance for utilising a complete 40GB GPU
profile.

For a complete GPU profile: no MIG devices will be created as well as have only one
unique UUID.

c. launch-jupyter-mig
i. To launch a Jupyter notebook instance for utilising a given MIG profile.
d. launch-jupyter-mig-rapids
i. To launch a Jupyter notebook instance for utilising a given MIG profile for
RAPIDS container.
5. Get the url to access jupyter notebook
$ cat jupyter-login-<instance-name>.txt

6. Launch jupyterlab in browser

● Copy and paste the url of the instance in the browser.


7. Launch the batch job
$ launch-job
Upload your python file in /workspace/home in jupyter instance. Then, try to run:
/workspace/home/<your python file>.py
Note
● Check the log of job
$ kubectl logs <pod-name>

● Destroy the running instances


$ destroy-jupyter -d <instance-name>
$ destroy-job -d <instance-name>
Troubleshoot

Step 1: Check the list of files


$ cd /home/username
$ ls

Step 2: Check all the Kubernetes resources created

$ kubectl get all


This shows that :
● Two pods named as: mytensorflow and tfw1 is created
● Three deployment named as: Two mytensorflow and tfw1 is created
● Three services named as: Two mytensorflow and tfw1 is created
● Three replicaset named as: Two mytensorflow and tfw1 is created

Step 3: Check the availability of MIG profiles/40gb GPU(without MIG)


$ nvidia-smi -L

The above shows that:


a. MIG 2g.10gb profile is created for GPUs ID from 0 - 5.
b. Complete 40gb GPU profile is available for GPUs ID 6 and 7.

Step 4: Check the resource quota given to you


$ kubectl get quota

For this particular user, following is the list of quota given:

MIG/GPU Profiles Profile used Resources Given

Nvidia 40 GB 0 0

MIG 1g.5gb 1 1

MIG 2g.10gb 1 1

MIG 3g.20gb 0 0

MIG 4g.20gb 0 0

MIG 7g.40gb 0 0

Note
● Based on the given resource for the user and available GPU/MIG profile choose
it when launching your jupyter instance or jobs.

Step 5: Check the events for failed POD/Deployment.


$ kubectl describe pod <pod_name>

$ kubectl describe deployment <deployment_name>


Step 6: Check the logs for your POD/Deployment

Step 7: Debugging/Understanding the error/logs for your POD instance.

● If the error in the event section is: “0/1 nodes are available: 1 Insufficient
nvidia.com/mig-1g.5gb ”
● If the error in the event section is:
○ Error while pulling new docker image
■ From launch-jupyter script:
Error from server (BadRequest): container "test" in pod "test-58994b899c-6mgpt" is
waiting to start: ContainerCreating

■ From Event section in “$ kubectl describe pod <pod_name>”


In this case Kubernetes Deployment is try pulling image "nvcr.io/nvidia/pytorch:21.06-
py3"

○ Error when docker image is incorrect or not present in the docker/nvidia


repository
Since, the image “nvcr.io/pytorch:21.07-py3“ is not present in the ngc repository. The
image name is incorrect, that’s why it’s throwing the error “ImagePullBackOff” error.
$ launch-jupyter

# Checking the summary of all K8s objects.


$ kubectl get all

#Check the Event logs of the failed POD.


$ kubectl describe pod <pod_name>
○ Error due to insufficient resource availability
$ kubectl describe pod <pod_name>

■ It can be either due a particular MIG/GPU profile is not given to the


user.
■ It can be also because new MIG profiles are created which are not
yet discovered by the K8s cluster.

Step 8: Resolving certain bugs/warnings from K8s cluster


● After verifying the availability of your resource as well as docker images, check
for the number and status of K8s PODs/Deployments created/running.
● If multiple such K8s resources are created utilizing the same resource or
unavailable resources, destroy such resources. For destroying you can use the
destroy script as well command.
$ kubectl delete pod <pod_name>
$ kubectl delete deployment <deployment_name>
$ kubectl delete service <service_name>
# Verify if the POD/Deployment is deleted:
$ kubectl get all
● If PODs/Deployment status is running and token is not generated, then can try
this method:

In the service table:


Name of the service is defined, as well as POD port information
- PORT(s) 8888:30731
- 8888 : Default container Port
- 30731: PODs port
$ kubectl logs <pod_name>

From the above:


- Get the information about the PODs token.
- token=xxxxxxxxxxxx

Go to browser:
1. Go to link: XXX:XX:X:X:30731
Where XXX:XX:X:X → Server IP address
30731 → POD Port

2. Copy the token from the logs and paste


3. Run the jupyter instance
Step 9: If the above methods don't work then
● Try re-running the launch scripts after destroying the existing instances.
● Debug the error from the message in the Events section in “kubectl describe pod
<pod_name>”.
● At last consult your administrator with an error message from Events.

DOS and DON’TS

DOS DON’TS

● Check the available MIG ● Launching the script directly


profiles/GPUs in the server. without verifying the MIG/GPU
● Also check resources given to your profiles created on the server.
user profile.

● Verify the docker image you want ● Launching the script for creating
to use for your project is present or the instance will not generate the
not. link for connecting the jupyter-lab
● If a certain docker image is not server.
present then ask the admin to pull ● If the link is not generated follow
those images. the 8th and 9th step for
● Also if the required image is not Troubleshooting.
available on NGC, check the
image in Docker Hub. Request
your admin with details of that
image.
● Have basic understanding of ● Having a basic understanding of
Docker images and containers, these concepts will help in
Kubernetes PODs. debugging the issues.

FAQs

1. Where to store the data?


● The data should be stored in the home directory. So even when you log out of
the container, your data will be persistent if stored in the /workspace/home/
directory in the jupyter instance.

2. How to access one’s data?

● You can access via terminal by:


$ cd /raid/home/<user-name>
3. When you are starting with any framework, you can also refer to the
tutorials for your reference.

4. When we exit from the ssh session, will the jupyter instance exist or
not?
● When a user exits from an ssh session, it will not break the instance. You should
keep in mind that jupyter should be used for quick prototyping. It’s not meant to
run large scale jobs.
● It's possible that you may not have access all the time. A lot of times, GPUs are
allocated to other users. You may not find a jupyter instance ready. Running a
batch job is more appropriate as it will launch the job, put it into queue and as
soon as resources are available, it’ll launch the job.

5. Can we download any external libraries in the jupyter instance and


will it persist?
● We can download external libraries in the terminal using pip install.
$ pip install <library-name>
● It will not persist in the jupyter instance, so it’s recommended to keep a
requirements.txt file. Put all the libraries and install them using pip3 install of the
requirement file as you create your new instance.

6. Can we create a virtual environment inside the container?


● It’s not recommended as the container has most of the dependencies inside it,
also it will not persist.

7. How to get different versions of tensorflow and pytorch?


● You can get them from the NGC site. And you can map the version of framework
DL_Framework_SupportMatrix .
8. GPU memory allocation for a container is shared or exclusive to the
user?
● GPU memory allocation for a container is exclusive to the user. It is one of the
reasons for recommending batch jobs.

9. Common kubectl commands needed.


● To see the running pods.
$kubectl get pods

● To check the state of pod or when pod state is not successful.


$ kubectl describe pod <pod-name>
● To get the log of the current job
$ kubectl logs <job-name>
10. Do we need docker prior to the installation of kubernetes?
Yes, we need a container runtime(in this case docker) prior to the installation of
kubernetes. There are different types of runtime, it might not be docker for others.

COMMANDS CHEAT SHEET


kubectl get all To get overall Kubernetes resources created.

kubectl describe pod <pod_name> To check the details of kubernetes PODs or deployment.

kubectl describe deployment


<deployment_name>

kubectl logs <pod_name/job_name> To check the logs of the Kubernetes PODs or JOBs.

kubectl get quota To check the resources assigned for a particular user
account.

kubectl delete pod <pod_name> To manually delete Kubernetes


PODs/Deployment/Services.
kubectl delete deployment
<deployment_name> ** Follow this when Kubernetes resources aren’t deleted
by the “destroy-jupyter/destroy-job” scripts.
kubectl delete service <service_name>

nvidia-smi -L To check the overview of MIG profiles created.

You might also like