0%

实验室服务器虚拟环境配置日志

使用LXD虚拟化技术为实验室服务器配置多用户隔离运行环境。

服务器监控netdata配置

1
2
bash <(curl -Ss https://my-netdata.io/kickstart.sh) --stable-channel --disable-telemetry
sudo apt-get install zlib1g-dev uuid-dev libuv1-dev liblz4-dev libjudy-dev libssl-dev libelf-dev libmnl-dev gcc make git autoconf autoconf-archive autogen automake pkg-config curl python cmake
  • 后续管理命令
1
2
3
4
5
sudo systemctl enable netdata # 注册系统服务
sudo systemctl start/stop/status netdata # 启动/暂停/查看
cat /etc/netdata/netdata.conf # 查看配置文件
sudo sh /usr/libexec/netdata/netdata-uninstaller.sh # 卸载
sudo sh /usr/libexec/netdata/netdata-updater.sh # 更新(默认每天更新)

容器管理LXD配置

安装容器管理系统lxd、文件系统zfs、网络桥接工具bridge-utils

1
2
sudo snap install lxd
sudo apt install zfsutils-linux bridge-utils

初始化lxd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
sudo lxd init

# 以下为初始化配置

Would you like to use LXD clustering? (yes/no) [default=no]: # 单机部署,不用集群
Do you want to configure a new storage pool? (yes/no) [default=yes]: # 在SSD上新建zfs存储池
Name of the new storage pool [default=default]: lxd # 命名存储池
Name of the storage backend to use (btrfs, ceph, dir, lvm, zfs) [default=zfs]: # zfs是坠吼的
Create a new ZFS pool? (yes/no) [default=yes]: # 二次确认
Would you like to use an existing block device? (yes/no) [default=no]: # 试过点yes,没成功,那算了
Size in GB of the new loop device (1GB minimum) [default=100GB]: 360GB # 比SSD总容量小一点就行
Would you like to connect to a MAAS server? (yes/no) [default=no]: # no
Would you like to create a new local network bridge? (yes/no) [default=yes]: # 建立容器到主机之间的桥接机制
Would you like to configure LXD to use an existing bridge or host interface? (yes/no) [default=no]: # no
Name of the existing bridge or host interface: # 命名虚拟桥接网卡
Would you like LXD to be available over the network? (yes/no) [default=no]: # 不要暴露容器,访问容器只用端口映射
Would you like stale cached images to be updated automatically? (yes/no) [default=yes] # 定期更新镜像
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]: # 不保存初始配置

配置lxd的默认profile

1
sudo lxc profile edit default

配置示例(yaml格式)

注意,块设备nvidia-uvm对容器内访问显卡驱动非常重要

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
config:
limits.cpu: "32"
limits.memory: 64GB
security.nesting: "true"
security.privileged: "true"
description: Default LXD profile
devices:
eth0:
limits.ingress: 200Mbit
name: eth0
nictype: bridged
parent: lxdbr0
type: nic
gpu:
type: gpu
nvidia-uvm:
path: /dev/nvidia-uvm
type: unix-char
root:
path: /
pool: default
size: 30GB
type: disk
shared-folder:
path: /usr/local/shared-folder
source: /usr/local/shared-folder
type: disk
name: default
used_by: []

从清华大学镜像加速站点拉取最新的ubuntu 18.04系统镜像

1
2
3
sudo lxc remote add tuna-images https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams --public
sudo lxc image list tuna-images: # 显示可供下载的镜像列表
sudo lxc launch tuna-images:<hash> <ContianerName> # 从清华镜像站点拉取镜像,在本地创建容器

安装容器监控设施lxdui,并注册为系统服务

1
2
3
4
5
6
7
8
9
sudo apt install -y git build-essential libssl-dev python3-venv python3-pip python3-dev zfsutils-linux bridge-utils
git clone https://github.com/AdaptiveScale/lxdui.git
cd lxdui

# 以下用root用户安装(不嫌麻烦的话可以新建一个用户组来管理lxdui)

sudo su
pip3 install .
su <PreviousUserName>

参考注册Linux系统服务教程lxdui官方参考unit文件,创建/etc/systemd/system/lxdui.service文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[Unit]
Description=Web UI for the native Linux container technology LXD/LXC
After=network.target snapd.service
Requires=snapd.service

[Service]
Type=simple
Restart=always
RestartSec=10
User=root
PIDFile=/run/lxdui/lxdui.pid
ExecStart=/usr/local/bin/lxdui start
ExecStop=/usr/local/bin/lxdui stop

[Install]
WantedBy=multi-user.target

启动服务,默认用户名admin,默认密码admin,默认端口15151

1
2
3
4
sudo systemctl daemon-reload
sudo systemctl enable lxdui
sudo systemctl start lxdui
sudo ufw allow from <SubnetIPAddress>/<LengthOfSubnetMask> to any port 15151

配置模板容器GPU环境

进入容器

1
sudo lxc exec <ContainerName> -- su --login <UserName>

配置sudo免密码

1
2
3
4
5
6
sudo su root
visudo

# 在用户权限配置文件中加以下行

<UserName> ALL=(ALL) NOPASSWD: NOPASSWD: ALL

参考tuna官方使用帮助,切换apt源到清华大学镜像加速站点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak
sudo vi /etc/apt/sources.list

# 以下为新的apt源url列表

# 默认注释了源码镜像以提高 apt update 速度,如有需要可自行取消注释
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-security main restricted universe multiverse

# 预发布软件源,不建议启用
# deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-proposed main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-proposed main restricted universe multiverse

下载显卡驱动、CUDA 10.1安装包、cuDNN 7.6.5压缩包

NVIDIA Driver Linux x86_64 Archive

CUDA Archive

cuDNN Archive

参考官方文档,配置CUDA 10.1

检查GPU硬件

1
2
3
sudo apt update
sudo apt install pciutils
lspci | grep -i nvidia

检查Linux发行版

1
uname -m && cat /etc/*release

检查gcc是否安装

1
2
gcc --version
sudo apt install build-essential manpages-dev # 安装gcc,g++,make以及相关文档

安装Ubuntu系统的内核依赖

1
sudo apt-get install linux-headers-$(uname -r)

从下载好的可执行文件安装显卡驱动、CUDA 10.1

1
2
sudo sh NVIDIA-Linux-x86_64-<DirverVersion>.run --no-kernel-module
sudo sh cuda_<CUDAVersion>_linux.run --no-drm

添加环境变量到~/.bashrc

1
2
3
4
5
6
7
vi ~/.bashrc

# CUDA 10.1
PATH=/usr/local/cuda-10.1/bin:/usr/local/cuda-10.1/NsightCompute-2019.1${PATH:+:${PATH}}
LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

source ~/.bashrc

最后,检查驱动版本和CUDA版本

1
2
cat /proc/driver/nvidia/version
nvcc -V

[Optional]编译示例代码

1
2
3
4
cd <SampleRootPath>
cd 1_Utilities/deviceQuery
make
./deviceQuery

[Optional]开启persistent模式

1
sudo /usr/bin/nvidia-persistenced --verbose

参考官方文档,配置cuDNN 7.6.5

1
2
3
4
5
cd <CudnnPath>
tar -xzvf cudnn-<CudnnVersion>.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

配置模板容器Python环境

下载Python 3.7.9源代码

Python Source Code Download Page

解压、安装依赖

1
2
3
tar -xzvf Python-3.7.9.tgz
sudo apt update
sudo apt install -y gcc make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev

配置编译参数

1
2
cd Python-3.7.9
./configure --prefix=/usr/local/python3.7 --enable-optimizations

编译

1
sudo make -j 32 && sudo make install

添加环境变量

1
2
3
4
5
6
7
8
vi ~/.bashrc

# Python 3.7
PATH=/usr/local/python3.7/bin${PATH:+:${PATH}}
PATH=/home/ubuntu/.local/bin${PATH:+:${PATH}}

source ~/.bashrc
sudo chown -R <UserName> ~/.local

配置清华大学镜像加速站点pip

1
2
3
4
5
python3.7 -m pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U --user
python3.7 -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# 阿里云可能更快一点
python3.7 -m pip config set global.index-url https://mirrors.aliyun.com/pypi/simple

举例:如何在用户空间下通过pip安装包(没有必要就不要sudo

1
python3.7 -m pip install numpy virtualenv --user --no-cache-dir

举例:如何在项目文件夹内配置虚拟环境

1
2
3
4
5
6
7
8
9
10
11
12
13
python3.7 -m virtualenv ./.venv/

# 激活虚拟环境
source ./venv/bin/activate

# 在虚拟环境中运行程序
python <PythonFileName>.py

# 在虚拟程序中安装包
python -m pip install numpy --no-cache-dir

# 退出虚拟环境
deactivate

安装pytorchtensorflow

1
2
python3.7 -m pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html --user --no-cache-dir
python3.7 -m pip install tensorflow-gpu==2.3.1 --user --no-cache-dir

测试是否能用GPU

1
2
3
4
5
6
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())

import tensorflow as tf
print(tf.test.is_gpu_available())

一键添加/删除用户脚本

导出模板容器为镜像

1
2
sudo lxc stop <ContainerName>
sudo lxc publish <ContainerName> --alias <NewImageName> --public

服务器端添加用户脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
echo "用户名是: "$1
echo "该用户的ssh端口是: "$2

# 从配好的镜像文件中启动新的容器
sudo lxc launch <ImageName> $1

# 必须等待容器网络配置好,否则无法安装openssh
sleep 10

# 挂载固态硬盘
if [ ! -d /mnt/data/${1} ] ; then
mkdir /mnt/data/${1}/
fi
sudo lxc config device add $1 data disk source=/mnt/data/${1}/ path=/mnt/data
echo "成功挂载固态硬盘!"

# 创建存放对称秘钥的文件夹
if [ ! -d /home/${ServerUserName}/keys ] ; then
mkdir /home/${ServerUserName}/keys
fi

if [ ! -d /usr/local/shared-folder/keys ] ; then
sudo mkdir /usr/local/shared-folder/keys
fi

# 安装容器中的openssh
sudo lxc exec $1 -- sudo apt update
sudo lxc exec $1 -- sudo apt install -y openssh-server
sudo lxc exec $1 -- sudo service ssh start
echo "容器内的ssh服务已启动!"

# 备份原来的sshd配置
sudo lxc exec $1 -- sudo cp /etc/ssh/sshd_config /etc/ssh/sshd_config.bak

# 加上写入权限
sudo lxc exec $1 -- sudo chmod a+w /etc/ssh/sshd_config

# 写入新的sshd配置
sshd_config="Port ${2}
AddressFamily any
ListenAddress 0.0.0.0
ListenAddress ::
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
PasswordAuthentication no
Subsystem sftp /usr/lib/openssh/sftp-server
AllowUsers ubuntu
UsePAM yes"

sudo lxc exec $1 -- sh -c "cat>/etc/ssh/sshd_config<<EOF
${sshd_config}
EOF"

# 恢复文件权限
sudo lxc exec $1 -- sudo chmod 644 /etc/ssh/sshd_config
echo "已生成新的sshd配置文件!"

# 在宿主机上生成rsa-2048密钥
ssh-keygen -b 2048 -t rsa -f /home/<UserName>/keys/id_rsa_$1 -q -N ""
echo "新密钥已生成!"

# 把公钥拷贝到共享目录下
sudo chmod a+r /home/<UserName>/keys/id_rsa_$1
sudo cp /home/<UserName>/keys/id_rsa_$1.pub /usr/local/shared-folder/keys/id_rsa_$1.pub
echo "公钥已写入共享目录下!"

# 把公钥写入.ssh文件夹下
sudo lxc exec $1 -- sudo mkdir /home/ubuntu/.ssh
sudo lxc exec $1 -- sudo touch /home/ubuntu/.ssh/authorized_keys
sudo lxc exec $1 -- sudo chmod a+w /home/ubuntu/.ssh/authorized_keys
sudo lxc exec $1 -- sh -c "sudo cat /usr/local/shared-folder/keys/id_rsa_$1.pub >> /home/ubuntu/.ssh/authorized_keys"

# 恢复.ssh目录的属主和文件权限
sudo lxc exec $1 -- sudo chown -R ubuntu /home/ubuntu/.ssh
sudo lxc exec $1 -- sudo chmod 700 /home/ubuntu/.ssh
sudo lxc exec $1 -- sudo chmod 644 /home/ubuntu/.ssh/authorized_keys
echo "公钥已写入.ssh文件夹!"

# 重启ssh服务
sudo lxc exec $1 -- sudo service ssh restart

# 端口映射
sudo lxc config device add $1 port-ssh proxy listen=tcp:0.0.0.0:$2 connect=tcp:127.0.0.1:$2
echo "容器内的ssh服务和端口配置完成!"

# 开启防火墙
sudo ufw allow $2/tcp
echo "防火墙端口$2已放通!"

管理员PC端添加用户脚本

1
2
3
4
5
6
7
8
9
10
11
echo "用户名是: "$1
echo "该用户的ssh端口是: "$2

# 从远端服务器下载私钥
scp -P <PortNumber> <UserName>@<ServerIPAddress>:/home/<UserName>/keys/id_rsa_$1 /home/<LocalUserName>/keys/id_rsa_$1

# ssh密钥权限是600,切记切记 :)
sudo chmod 600 /home/<LocalUserName>/keys/id_rsa_$1

# 尝试连接
ssh -i /home/<LocalUserName>/keys/id_rsa_$1 ubuntu@<ServerIPAddress> -p $2

服务器端删除用户脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
echo "即将删除容器 $1 相关配置,端口号 $2"

# 删除本地保存的秘钥
rm /home/<UserName>/keys/id_rsa_$1*
sudo rm /usr/local/shared-folder/keys/id_rsa_$1*

# 暂停容器
sudo lxc stop $1

# 删除容器
sudo lxc delete $1

# 删除防火墙规则
sudo ufw delete allow $2/tcp

echo "容器已经删除!"