推荐算法再次踩坑记录

去年搞通了EasyRec这个玩意，没想到今年还要用推荐方面的东西，行吧，再来一次，再次踩坑试试。

1、EasyRec训练测试数据下载：

git clone后，进入EasyRec，然后执行：bash scripts/init.sh 将所用到的数据全部下载完成✅

2、模型部署，参考博文docker部署tf-serving ：

首先必须将final文件,将此文件下的全部复制到/models/half_plus_two/下面。

2.1查看模型基本参数：

saved_model_cli show --dir /models/half_plus_two/00000123/ --tag_set serve --signature_def serving_default
The given SavedModel SignatureDef contains the following input(s):
  inputs['x'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: x:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['y'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: y:0
Method name is: tensorflow/serving/predict

2.2查看模型是否支持GPU

saved_model_cli show --dir /models/half_plus_two/00000123/
The given SavedModel contains the following tag-sets:
serve#表示不支持GPU，支持的应该是serve,gpu

2.3输入数据查看结果

saved_model_cli run --dir /models/half_plus_two/00000123/ --tag_set serve --signature_def serving_default --input_exprs="x=[[1],[9]]"
[[2.5]
 [6.5]]
结果正确，一半+2，1*0.5+2，9*0.5+2，结果一致

2.4curl: (56) Recv failure: Connection reset by peer

发现docker -p指定端口不能用，其中有鬼，之前服务器就可以指定端口，因此仍旧改为8501端口。

docker启动代码见此文。

2.5docker停止及删除容器 ,没有此操作无法重启该名字的容器。

注意，里面有模型名字及容器名字，建议都取一样的名字。比如half_plus_two

docker kill half_plus_two
docker rm half_plus_two

3、部署训练好的dssm模型

3.1查看模型输入参数

saved_model_cli show --dir /models/mydssm/163333/ --tag_set serve --signature_def serving_default
The given SavedModel SignatureDef contains the following input(s):
  inputs['app_category'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_9:0
  inputs['app_domain'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_8:0
  inputs['app_id'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_7:0
  inputs['banner_pos'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_3:0
  inputs['c1'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_2:0
  inputs['c14'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_15:0
  inputs['c15'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_16:0
  inputs['c16'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_17:0
  inputs['c17'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_18:0
  inputs['c18'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_19:0
  inputs['c19'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_20:0
  inputs['c20'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_21:0
  inputs['c21'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_22:0
  inputs['device_conn_type'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_14:0
  inputs['device_id'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_10:0
  inputs['device_ip'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_11:0
  inputs['device_model'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_12:0
  inputs['device_type'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_13:0
  inputs['hour'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_1:0
  inputs['site_category'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_6:0
  inputs['site_domain'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_5:0
  inputs['site_id'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: input_4:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: Squeeze:0
  outputs['probs'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: Sigmoid:0
Method name is: tensorflow/serving/predict

#同样也不支持GPU
saved_model_cli show --dir /models/mydssm/13339343/
The given SavedModel contains the following tag-sets:
serve

3.2测试请求

curl -d '{"instances": [{"app_category":"1","app_domain":"2","app_id":"3","banner_pos":"4","c1":"5","c14":"6","c15":"7","c16":"8","c17":"9","c18":"10","c19":"11","c20":"12","c21":"13","device_conn_type":"14","device_id":"15","device_ip":"16","device_model":"17","device_type":"18","hour":"19","site_category":"20","site_domain":"21","site_id":"22"}]}'     -X POST http://localhost:8501/v1/models/mydssm:predict
{
    "predictions": [
        {
            "logits": -2.64485741,
            "probs": 0.0663066804
        }
    ]

#https://github.com/tensorflow/serving/issues/2104
>>> import json,requests
>>> heads = {
        "content-type": "application/json"
    }
>>> jd={"signature_name": "serving_default","instances":[{"app_category":"1","app_domain":"2","app_id":"3","banner_pos":"4","c1":"5","c14":"6","c15":"7","c16":"8","c17":"9","c18":"10","c19":"11","c20":"12","c21":"13","device_conn_type":"14","device_id":"15","device_ip":"16","device_model":"17","device_type":"18","hour":"19","site_category":"20","site_domain":"21","site_id":"22"}]}
>>> requests.post(url,data=json.dumps(jd),headers=heads).json()
{'predictions': [{'logits': -2.64485741, 'probs': 0.0663066804}]}

4、部署 tf-serving使用GPU，需要docker安装GPU版本

4.1拉取images

docker pull tensorflow/serving:latest-gpu

4.2安装nvidia docker容器工具

CentOS

sudo dnf clean expire-cache \
    && sudo dnf install -y nvidia-container-toolkit-base
nvidia-ctk --version
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
grep "  name:" /etc/cdi/nvidia.yaml

Ubuntu

 sudo apt-get update \
    && sudo apt-get install -y nvidia-container-toolkit-base
nvidia-ctk --version
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
grep "  name:" /etc/cdi/nvidia.yaml

4.3启动

docker run --runtime=nvidia -p 8501:8501 \ --mount type=bind,\ source=/tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu,\ target=/models/half_plus_two \ -e MODEL_NAME=half_plus_two -t tensorflow/serving:latest-gpu --per_process_gpu_memory_fraction=0.5
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

综上所述，不足之处请参考，英伟达官方介绍，tf-serving官方github

拜拜