关于我和3DGNN论文及源码的爱恨情仇

2019-03-23 2019-11-02 16 分钟读完 (大约 2358 个字)

caffe

caffe / matlab

配这篇论文源码的环境的过程实在是太艰难了。
也太艰难了。
究竟是什么样的狠人才会用matlab + caffe的环境啊！！

参考论文

3D Graph Neural Networks for RGBD Semantic Segmentation
源码在仓库3DGNN

预先配好的环境：

Ubuntu 16.04(64bit)
cuda: 8.0
cudnn: 5.1
OpenCV: 3.4.3
Matlab: R2016b
显卡: GeForce 1080Ti x2

安装MATLAB

我的服务器是无图形界面的，安装MATLAB的时候要开启mode=silent。
并且因此无法通过网络下载matlab需要用到的toolbox库，所以必须要挂载iso镜像。

在进行make matcaffe之前，要解决几个编译器版本的问题。

在安装路径xxx/MATLAB/R2016b/sys/os/glnxa64下
把 libstdc++.so.6 重命名为 libstdc++.so.6_back

1 2	sudo mv libstdc++.so.6 libstdc++.so.6_back sudo ldconfig

这时Matlab找不到libstdc++.so.6，会找到系统文件下的/usr/lib/libstdc++.so.6

处理这个问题的另外一个办法是建立一个soft link（为一个lib建立一个重定向，以后要找这个名字的lib，会定向到指定的lib那里）

1
2
3

cd /usr/local/MATLAB/R2016b/sys/os/glnxa64
sudo rm libstdc++.so.6
sudo ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 libstdc++.so.6

安装caffe

make all -j8
make test -j8
make runtest -j8

错误信息: 找不到hdf5

fatal error: hdf5.h: No such file or directory

解决办法一
在caffe文件夹下的MakeFile.config
INCLUDE_DIRS := 最后加上 /usr/include/hdf5/serial/
LIBRARY_DIRS := 最后加上 /usr/lib/x86_64-linux-gnu/hdf5/serial/
保存，重新编译即可。
解决办法二（如果有root权限）
增加软连接：
cd /usr/lib/x86_64-linux-gnu
sudo ln -s libhdf5_serial.so.8.0.2 libhdf5.so
sudo ln -s libhdf5_serial_hl.so.8.0.2 libhdf5_hl.so

错误信息：找不到-lhdf5

LD -o .build_release/lib/libcaffe.so.1.0.0-rc3
/usr/bin/ld: cannot find -lhdf5
collect2: error: ld returned 1 exit status
Makefile:563: recipe for target ‘.build_release/lib/libcaffe.so.1.0.0-rc3’ failed
make: * [.build_release/lib/libcaffe.so.1.0.0-rc3] Error 1

解决办法：
改Makefile里
LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_serial_hl hdf5_serial

错误信息：关于cudnn

In file included from ./include/caffe/util/device_alternate.hpp:40:0,
from ./include/caffe/common.hpp:19,
from src/caffe/common.cpp:7:
./include/caffe/util/cudnn.hpp: In function ‘void caffe::cudnn::createPoolingDesc(cudnnPoolingStruct, caffe::PoolingParameter_PoolMethod, cudnnPoolingMode_t, int, int, int, int, int, int)’:
./include/caffe/util/cudnn.hpp:127:41: error: too few arguments to function ‘cudnnStatus_t cudnnSetPooling2dDescriptor(cudnnPoolingDescriptor_t, cudnnPoolingMode_t, cudnnNanPropagation_t, int, int, int, int, int, int)’
pad_h, pad_w, stride_h, stride_w));
^
./include/caffe/util/cudnn.hpp:15:28: note: in definition of macro ‘CUDNN_CHECK’
cudnnStatus_t status = condition; \
^
In file included from ./include/caffe/util/cudnn.hpp:5:0,
from ./include/caffe/util/device_alternate.hpp:40,
from ./include/caffe/common.hpp:19,
from src/caffe/common.cpp:7:
/usr/local/cuda-7.5//include/cudnn.h:803:27: note: declared here
cudnnStatus_t CUDNNWINAPI cudnnSetPooling2dDescriptor(
^
make: [.build_release/src/caffe/common.o] Error 1

参考链接(https://blog.csdn.net/u011070171/article/details/52292680)[https://blog.csdn.net/u011070171/article/details/52292680]

这是因为当前版本的caffe的cudnn实现与系统所安装的cudnn的版本不一致引起的。

解决办法(害怕自己有失误的宝宝要记得备份哦)：

git clone最新版本的caffe源码
将./include/caffe/util/cudnn.hpp 换成最新版的caffe里的cudnn的实现，即相应的cudnn.hpp.
将./include/caffe/layers里的，所有以cudnn开头的文件，例如cudnn_conv_layer.hpp。都替换成最新版的caffe里的相应的同名文件。
将./src/caffe/layers里的，所有以cudnn开头，以.cpp或者.cu结尾的文件，例如cudnn_lrn_layer.cu，cudnn_pooling_layer.cpp，cudnn_sigmoid_layer.cu，都替换成最新版的caffe里的相应的同名文件。

错误信息：cannot find -lopencv_dep_cudart

/usr/bin/ld: cannot find -lopencv_dep_cudart
collect2: error: ld returned 1 exit status
src/caffe/CMakeFiles/caffe.dir/build.make:4285: recipe for target ‘lib/libcaffe.so.1.0.0-rc3’ failed
make[2]: [lib/libcaffe.so.1.0.0-rc3] Error 1
CMakeFiles/Makefile2:272: recipe for target ‘src/caffe/CMakeFiles/caffe.dir/all’ failed
make[1]: [src/caffe/CMakeFiles/caffe.dir/all] Error 2
Makefile:127: recipe for target ‘all’ failed
make: * [all] Error 2

解决办法：
需要在bash里set CUDA_USE_STATIC_CUDA_RUNTIME=off，再make

错误信息：make的时候libcaffe.so出错 undefined reference to cv::

.build_release/lib/libcaffe.so: undefined reference to `cv::_InputArray::_InputArray(cv::Mat const&)’

.build_release/lib/libcaffe.so: undefined reference to `cv::imdecode(cv::_InputArray const&, int)’

.build_release/lib/libcaffe.so: undefined reference to `cv::imencode(std::__cxx11::basic_string, std::allocator > const&, cv::_InputArray const&, std::vector >&, std::vector > const&)’

.build_release/lib/libcaffe.so: undefined reference to `CvKNearest::CvKNearest(CvMat const, CvMat const, CvMat const*, bool, int)’

……………………………….

collect2: error: ld returned 1 exit status

Makefile:616: recipe for target ‘.build_release/tools/compute_image_mean.bin’ failed

make: * [.build_release/tools/compute_image_mean.bin] Error 1

出现这样的问题是因为caffe没有找到库。

可能原因：

使用的是opencv3版本
需要在Makefile.config里把USE_OPENCV_VERSION := 3的注释去掉，
再注释掉USE_PKG_CONFIG := 1。
没找到库路径
可以使用LDD .build_release/lib/libcaffe.so查看libcaffe.so链接的动态链接库。
如果出现libxxx.so => not found，则说明是库没找到，或者是原因3。
需要先把库路径加在Makefile.config里LIBRARY_DIRS := …后面，
再添加你需要的库到Makefile里libraries := …..的后面，
即可。
找错了库路径
Makefile找库的时候是按照Makefile.config里LIBRARY_DIRS变量值的顺序找的。
如果在前面的库路径下存在你需要的库，就不会再从后面的库路径里找了。
同样可以使用LDD .build_release/lib/libcaffe.so查看libcaffe.so链接的动态链接库，查看库路径是否正确。
如果路径错误，可以对Makefile.config里LIBRARY_DIRS的顺序进行调整。

错误信息：make runtest error in `xxx/test/test.testbin’: free(): invalid pointer

在使用caffe run test的时候，如果出现如下报错
*** Error in xxx/test/test.testbin': free(): invalid pointer: 0x00007fbaf71accb8 ***

很可能是缺少libtcmalloc-minimal4库，首先安装
sudo apt-get install libtcmalloc-minimal4

然后打开~/.bashrc文件
vim ~/.bashrc

在文件末尾添加如下代码：
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

最后重新载入环境变量即可
source ~/.bashrc

编译及测试matcaffe接口

编译matcaffe

make matcaffe

设置opencv的软连接。

如果直接用make matcaffe编译之后在matlab里测试，可能会出现以下错误，

1
2
3

Invalid MEX-file '/home/dong/caffe/matlab/+caffe/private/caffe_.mexa64':
/home/dong/caffe/matlab/+caffe/private/caffe_.mexa64: undefined symbol:
_ZN2cv8imencodeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_11_InputArrayERSt6vectorIhSaIhEERKSB_IiSaIiEE

这是由于我编译caffe所使用的OpenCV（2.4.13）版本跟matlab自带的版本（2.4.9）不一致，导致找不到相关变量名。

这里需要重定向一下（顺便备份）：

cd /xxx/MATLAB/R2016b/bin/glnxa64
sudo mv libopencv_core.so.2.4 libopencv_core.so.2.4_back
sudo mv libopencv_highgui.so.2.4 libopencv_highgui.so.2.4_back
sudo mv libopencv_imgproc.so.2.4 libopencv_imgproc.so.2.4_back
sudo ln -s /usr/lib/x86_64-linux-gnu/libopencv_core.so.2.4.13 libopencv_core.so.2.4
sudo ln -s /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4.13 libopencv_highgui.so.2.4
sudo ln -s /usr/lib/x86_64-linux-gnu/libopencv_imgproc.so.2.4.13 libopencv_imgproc.so.2.4

测试matcaffe

在进行make mattest前，要解决几个编译器版本的问题。

方法1
在/usr/local/MATLAB/R2014b/sys/os/glnxa64下
把 libstdc++.so.6 重命名为 libstdc++.so.6_back
mv libstdc++.so.6 libstdc++.so.6_back
这是Matlab找不到libstdc++.so.6之后，会到/usr/lib中找系统用到libstdc++版本

方法2
处理这个问题的另外一个办法是建立一个soft link（为一个lib建立一个重定向，以后要找这个名字的lib，会定向到指定的lib那里）
1
2
3
cd /usr/local/MATLAB/R2016b/sys/os/glnxa64
sudo rm libstdc++.so.6
sudo ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 libstdc++.so.6

进入matlab

注意：caffe路径下生成的matlab folder里应该有+caffe folder，matcaffe需要用到的相关函数实现和文件都在这里。
要先addpath('/your_caffe_path/matlan')才能让matlab找到matcaffe，
在nyu_crop_data_mask_msc.m里找到gpu_id，选择你的机器上空闲的gpu id。

Terminal里用nvidia-smi指令显示gpu相关信息。

nvidia-smi后显示的信息如下图：

| NVIDIA-SMI 384.130                Driver Version: 384.130                   |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |

| 33%   32C    P0    54W / 250W |      0MiB / 11172MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

|   1  GeForce GTX 108...  Off  | 00000000:65:00.0 Off |                  N/A |

| 33%   35C    P0    50W / 250W |      0MiB / 11170MiB |      1%      Default |

+-------------------------------+----------------------+———————————+

我这里的GPU序列有0和1，可以设置成gpu_id = 0或者gpu_id = 1。

(注意，matcaffe只能用单gpu训练网络)
然后run nyu_crop_data_mask_msc.m，
成功。

run matlabscript

错误信息：unknown pooling method

I0331 20:04:23.127297 49968 net.cpp:106] Creating Layer x_1_avepool
I0331 20:04:23.127324 49968 net.cpp:454] x_1_avepool <- out_reduce_out_reduce_relu_0_split_0
I0331 20:04:23.127333 49968 net.cpp:454] x_1_avepool <- knn_knn_0_split_0
I0331 20:04:23.127346 49968 net.cpp:411] x_1_avepool -> x_1_avepool
F0331 20:04:23.127763 49968 cudnn.hpp:144] Unknown pooling method.
*** Check failure stack trace: ***

这是因为源码中并没有用到cudnn。
往前看一下输出，可以看到名字是x_1_avepool这层的定义

layer {
  name: "x_1_avepool"
  type: "Pooling"
  bottom: "out_reduce"
  bottom: "knn"
  top: "x_1_avepool"
  pooling_param {
    pool: KNNPOOL
    kernel_size: 11
    pad: 5
  }
}

所以在util中的cudnn.h里，没有定义KNNPOOL的pooling method。
注释掉Makefile.config里的USE_CUDNN即可

cmake

错误信息：cmake的make不成功

src/caffe/test/test_gradient_based_solver.cpp:370: Failure
The difference between expected_updated_weight and solver_updated_weight is 1.1920928955078125e-07, which exceeds error_margin, where
expected_updated_weight evaluates to 9.6857547760009766e-06,
solver_updated_weight evaluates to 9.8049640655517578e-06, and
error_margin evaluates to 1.0000000116860974e-07.
[ FAILED ] NesterovSolverTest/2.TestNesterovLeastSquaresUpdateWithEverythingShare, where TypeParam = caffe::GPUDevice (8073 ms)
[ RUN ] NesterovSolverTest/2.TestLeastSquaresUpdateWithEverythingAccumShare
[ OK ] NesterovSolverTest/2.TestLeastSquaresUpdateWithEverythingAccumShare (28 ms)
[ RUN ] NesterovSolverTest/2.TestNesterovLeastSquaresUpdateWithEverything
src/caffe/test/test_gradient_based_solver.cpp:370: Failure
The difference between expected_updated_weight and solver_updated_weight is 1.1920928955078125e-07, which exceeds error_margin, where
expected_updated_weight evaluates to 9.6857547760009766e-06,
solver_updated_weight evaluates to 9.8049640655517578e-06, and
error_margin evaluates to 1.0000000116860974e-07.
[ FAILED ] NesterovSolverTest/2.TestNesterovLeastSquaresUpdateWithEverything, where TypeParam = caffe::GPUDevice (7338 ms)

在make runtest之前，export CUDA_VISIBLE_DEVICES=0

再make runtest 就成功啦。

# caffe, matlab

本文总阅读量次