使用LXD虚拟化技术为实验室服务器配置多用户隔离运行环境。
服务器监控netdata配置
1 2 bash <(curl -Ss https://my-netdata.io/kickstart.sh) --stable-channel --disable-telemetry sudo apt-get install zlib1g-dev uuid-dev libuv1-dev liblz4-dev libjudy-dev libssl-dev libelf-dev libmnl-dev gcc make git autoconf autoconf-archive autogen automake pkg-config curl python cmake
1 2 3 4 5 sudo systemctl enable netdata sudo systemctl start/stop/status netdata cat /etc/netdata/netdata.conf sudo sh /usr/libexec/netdata/netdata-uninstaller.sh sudo sh /usr/libexec/netdata/netdata-updater.sh
容器管理LXD配置 安装容器管理系统lxd
、文件系统zfs
、网络桥接工具bridge-utils
1 2 sudo snap install lxd sudo apt install zfsutils-linux bridge-utils
初始化lxd
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 sudo lxd init Would you like to use LXD clustering? (yes /no) [default=no]: Do you want to configure a new storage pool? (yes /no) [default=yes ]: Name of the new storage pool [default=default]: lxd Name of the storage backend to use (btrfs, ceph, dir , lvm, zfs) [default=zfs]: Create a new ZFS pool? (yes /no) [default=yes ]: Would you like to use an existing block device? (yes /no) [default=no]: Size in GB of the new loop device (1GB minimum) [default=100GB]: 360GB Would you like to connect to a MAAS server? (yes /no) [default=no]: Would you like to create a new local network bridge? (yes /no) [default=yes ]: Would you like to configure LXD to use an existing bridge or host interface? (yes /no) [default=no]: Name of the existing bridge or host interface: Would you like LXD to be available over the network? (yes /no) [default=no]: Would you like stale cached images to be updated automatically? (yes /no) [default=yes ] Would you like a YAML "lxd init" preseed to be printed? (yes /no) [default=no]:
配置lxd
的默认profile
1 sudo lxc profile edit default
配置示例(yaml格式) 注意,块设备nvidia-uvm
对容器内访问显卡驱动非常重要
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 config: limits.cpu: "32" limits.memory: 64GB security.nesting: "true" security.privileged: "true" description: Default LXD profile devices: eth0: limits.ingress: 200Mbit name: eth0 nictype: bridged parent: lxdbr0 type: nic gpu: type: gpu nvidia-uvm: path: /dev/nvidia-uvm type: unix-char root: path: / pool: default size: 30GB type: disk shared-folder: path: /usr/local/shared-folder source: /usr/local/shared-folder type: disk name: default used_by: []
从清华大学镜像加速站点拉取最新的ubuntu 18.04
系统镜像 1 2 3 sudo lxc remote add tuna-images https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams --public sudo lxc image list tuna-images: sudo lxc launch tuna-images:<hash > <ContianerName>
安装容器监控设施lxdui
,并注册为系统服务 1 2 3 4 5 6 7 8 9 sudo apt install -y git build-essential libssl-dev python3-venv python3-pip python3-dev zfsutils-linux bridge-utils git clone https://github.com/AdaptiveScale/lxdui.git cd lxduisudo su pip3 install . su <PreviousUserName>
参考注册Linux系统服务教程 、lxdui官方参考unit文件 ,创建/etc/systemd/system/lxdui.service
文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [Unit] Description =Web UI for the native Linux container technology LXD/LXCAfter =network.target snapd.serviceRequires =snapd.service[Service] Type =simpleRestart =alwaysRestartSec =10 User =rootPIDFile =/run/lxdui/lxdui.pidExecStart =/usr/local/bin/lxdui startExecStop =/usr/local/bin/lxdui stop[Install] WantedBy =multi-user.target
启动服务,默认用户名admin
,默认密码admin
,默认端口15151
1 2 3 4 sudo systemctl daemon-reload sudo systemctl enable lxdui sudo systemctl start lxdui sudo ufw allow from <SubnetIPAddress>/<LengthOfSubnetMask> to any port 15151
配置模板容器GPU环境 进入容器 1 sudo lxc exec <ContainerName> -- su --login <UserName>
配置sudo
免密码 1 2 3 4 5 6 sudo su root visudo <UserName> ALL=(ALL) NOPASSWD: NOPASSWD: ALL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak sudo vi /etc/apt/sources.list deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic main restricted universe multiverse deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-updates main restricted universe multiverse deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-backports main restricted universe multiverse deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-security main restricted universe multiverse
下载显卡驱动、CUDA 10.1
安装包、cuDNN 7.6.5
压缩包 NVIDIA Driver Linux x86_64 Archive
CUDA Archive
cuDNN Archive
参考官方文档 ,配置CUDA 10.1
检查GPU
硬件
1 2 3 sudo apt update sudo apt install pciutils lspci | grep -i nvidia
检查Linux
发行版
1 uname -m && cat /etc/*release
检查gcc
是否安装
1 2 gcc --version sudo apt install build-essential manpages-dev
安装Ubuntu
系统的内核依赖
1 sudo apt-get install linux-headers-$(uname -r)
从下载好的可执行文件安装显卡驱动、CUDA 10.1
1 2 sudo sh NVIDIA-Linux-x86_64-<DirverVersion>.run --no-kernel-module sudo sh cuda_<CUDAVersion>_linux.run --no-drm
添加环境变量到~/.bashrc
1 2 3 4 5 6 7 vi ~/.bashrc PATH=/usr/local/cuda-10.1/bin:/usr/local/cuda-10.1/NsightCompute-2019.1${PATH:+:${PATH} } LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH} } source ~/.bashrc
最后,检查驱动版本和CUDA
版本
1 2 cat /proc/driver/nvidia/versionnvcc -V
[Optional] 编译示例代码
1 2 3 4 cd <SampleRootPath>cd 1_Utilities/deviceQuerymake ./deviceQuery
[Optional] 开启persistent
模式
1 sudo /usr/bin/nvidia-persistenced --verbose
参考官方文档 ,配置cuDNN 7.6.5
1 2 3 4 5 cd <CudnnPath>tar -xzvf cudnn-<CudnnVersion>.tgz sudo cp cuda/include/cudnn.h /usr/local/cuda/include sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64 sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
配置模板容器Python环境 下载Python 3.7.9
源代码 Python Source Code Download Page
解压、安装依赖 1 2 3 tar -xzvf Python-3.7.9.tgz sudo apt update sudo apt install -y gcc make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev
配置编译参数 1 2 cd Python-3.7.9./configure --prefix=/usr/local/python3.7 --enable-optimizations
编译 1 sudo make -j 32 && sudo make install
添加环境变量 1 2 3 4 5 6 7 8 vi ~/.bashrc PATH=/usr/local/python3.7/bin${PATH:+:${PATH} } PATH=/home/ubuntu/.local/bin${PATH:+:${PATH} } source ~/.bashrcsudo chown -R <UserName> ~/.local
配置清华大学镜像加速站点pip
源 1 2 3 4 5 python3.7 -m pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U --user python3.7 -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple python3.7 -m pip config set global.index-url https://mirrors.aliyun.com/pypi/simple
举例:如何在用户空间下通过pip
安装包(没有必要就不要sudo
) 1 python3.7 -m pip install numpy virtualenv --user --no-cache-dir
举例:如何在项目文件夹内配置虚拟环境 1 2 3 4 5 6 7 8 9 10 11 12 13 python3.7 -m virtualenv ./.venv/ source ./venv/bin/activatepython <PythonFileName>.py python -m pip install numpy --no-cache-dir deactivate
安装pytorch
和tensorflow
1 2 python3.7 -m pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html --user --no-cache-dir python3.7 -m pip install tensorflow-gpu==2.3.1 --user --no-cache-dir
测试是否能用GPU
1 2 3 4 5 6 import torchprint (torch.cuda.is_available())print (torch.cuda.device_count())import tensorflow as tfprint (tf.test.is_gpu_available())
一键添加/删除用户脚本 导出模板容器为镜像 1 2 sudo lxc stop <ContainerName> sudo lxc publish <ContainerName> --alias <NewImageName> --public
服务器端添加用户脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 echo "用户名是: " $1 echo "该用户的ssh端口是: " $2 sudo lxc launch <ImageName> $1 sleep 10if [ ! -d /mnt/data/${1} ] ; then mkdir /mnt/data/${1} / fi sudo lxc config device add $1 data disk source =/mnt/data/${1} / path=/mnt/data echo "成功挂载固态硬盘!" if [ ! -d /home/${ServerUserName} /keys ] ; then mkdir /home/${ServerUserName} /keys fi if [ ! -d /usr/local/shared-folder/keys ] ; then sudo mkdir /usr/local/shared-folder/keys fi sudo lxc exec $1 -- sudo apt update sudo lxc exec $1 -- sudo apt install -y openssh-server sudo lxc exec $1 -- sudo service ssh start echo "容器内的ssh服务已启动!" sudo lxc exec $1 -- sudo cp /etc/ssh/sshd_config /etc/ssh/sshd_config.bak sudo lxc exec $1 -- sudo chmod a+w /etc/ssh/sshd_config sshd_config="Port ${2} AddressFamily any ListenAddress 0.0.0.0 ListenAddress :: PubkeyAuthentication yes AuthorizedKeysFile .ssh/authorized_keys PasswordAuthentication no Subsystem sftp /usr/lib/openssh/sftp-server AllowUsers ubuntu UsePAM yes" sudo lxc exec $1 -- sh -c "cat>/etc/ssh/sshd_config<<EOF ${sshd_config} EOF" sudo lxc exec $1 -- sudo chmod 644 /etc/ssh/sshd_config echo "已生成新的sshd配置文件!" ssh-keygen -b 2048 -t rsa -f /home/<UserName>/keys/id_rsa_$1 -q -N "" echo "新密钥已生成!" sudo chmod a+r /home/<UserName>/keys/id_rsa_$1 sudo cp /home/<UserName>/keys/id_rsa_$1 .pub /usr/local/shared-folder/keys/id_rsa_$1 .pub echo "公钥已写入共享目录下!" sudo lxc exec $1 -- sudo mkdir /home/ubuntu/.ssh sudo lxc exec $1 -- sudo touch /home/ubuntu/.ssh/authorized_keys sudo lxc exec $1 -- sudo chmod a+w /home/ubuntu/.ssh/authorized_keys sudo lxc exec $1 -- sh -c "sudo cat /usr/local/shared-folder/keys/id_rsa_$1 .pub >> /home/ubuntu/.ssh/authorized_keys" sudo lxc exec $1 -- sudo chown -R ubuntu /home/ubuntu/.ssh sudo lxc exec $1 -- sudo chmod 700 /home/ubuntu/.ssh sudo lxc exec $1 -- sudo chmod 644 /home/ubuntu/.ssh/authorized_keys echo "公钥已写入.ssh文件夹!" sudo lxc exec $1 -- sudo service ssh restart sudo lxc config device add $1 port-ssh proxy listen=tcp:0.0.0.0:$2 connect=tcp:127.0.0.1:$2 echo "容器内的ssh服务和端口配置完成!" sudo ufw allow $2 /tcp echo "防火墙端口$2 已放通!"
管理员PC端添加用户脚本 1 2 3 4 5 6 7 8 9 10 11 echo "用户名是: " $1 echo "该用户的ssh端口是: " $2 scp -P <PortNumber> <UserName>@<ServerIPAddress>:/home/<UserName>/keys/id_rsa_$1 /home/<LocalUserName>/keys/id_rsa_$1 sudo chmod 600 /home/<LocalUserName>/keys/id_rsa_$1 ssh -i /home/<LocalUserName>/keys/id_rsa_$1 ubuntu@<ServerIPAddress> -p $2
服务器端删除用户脚本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 echo "即将删除容器 $1 相关配置,端口号 $2 " rm /home/<UserName>/keys/id_rsa_$1 *sudo rm /usr/local/shared-folder/keys/id_rsa_$1 * sudo lxc stop $1 sudo lxc delete $1 sudo ufw delete allow $2 /tcp echo "容器已经删除!"