Inference Engineで顔検出

2020/05/28

「AI CORE XスターターキットとOpenVINO™ですぐに始めるディープラーニング推論」シリーズの７回目記事です。

このシリーズは、「ディープラーニングとは何か」から始まり、「各種ツールの使い方」「プログラミング基礎」「プログラミング応用・実践」までをステップバイステップでじっくり学び、自分で理解してオリジナルのAIアプリケーションが作れるようになることを目指しています。

第７回目は入力された画像の中から顔の位置を検出するディープラーニング推論を学びます。

必要なファイル準備

前回同様、推論を行う際に必要なファイル2種類を用意します。１つは推論の対象とする「入力画像」、もう１つは「モデルと重み」です。

入力画像

今回は顔を検出するので、顔を含む画像を用意します。前回はトリミングしましたが、今回はこちらのページの画像をそのまま使います。下の方にダウンロードボタンがあります。サイズはSを使用しました。

ファイル名はphoto.jpgに変更します。
前回同様、フォルダimageの中に入れて下さい。

モデルと重み

端末を開いてdownloader.pyのあるディレクトリへ移動します。

cd /opt/intel/openvino/deployment_tools/tools/model_downloader

今回はface-detection-retail-0005のダウンロードを実行します。

python3 downloader.py --name face-detection-retail-0005 --output_dir ~/workspace

これで、face-detection-retail-0005フォルダがダウンロードされました。

ディープラーニング推論実行

準備ができましたので、workspaceフォルダ直下で下記コードをコピペして作成し実行してみて下さい

# import 
import cv2
import numpy as np
from openvino.inference_engine import IENetwork, IEPlugin
 
# ターゲットデバイスの指定 
plugin = IEPlugin(device='MYRIAD')
 
# モデルの読み込み 
net  = IENetwork(model='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.xml', weights='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.bin')
exec_net = plugin.load(network=net)
 
# 入出力データのキー取得 
input_blob = next(iter(net.inputs))
out_blob = next(iter(net.outputs))
 
# 画像読み込み 
frame = cv2.imread('image/photo.jpg')
 
# 入力データフォーマットへ変換 
img = cv2.resize(frame, (300, 300)) # HeightとWidth変更 
img = img.transpose((2, 0, 1))      # HWC > CHW 
img = np.expand_dims(img, axis=0)   # CHW > BCHW 
 
# 推論実行 
out = exec_net.infer(inputs={input_blob: img})
 
# 出力から必要なデータのみ取り出し 
out = out[out_blob]
 
# 不要な次元を削減 
out = np.squeeze(out)
 
# 中身を出力 
print(out)

このような結果が表示されたと思います

[[ 0.          1.          1.         ...  0.18713379  0.5932617
   0.53271484]
 [-1.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]

...の部分は、実際にはたくさんデータがあるけど省略しているという意味です。
それでは、コードや結果の詳細について見てゆきましょう。

コードと実行結果の解説

前回の最初のコードからの変更点は３箇所のみです

モデルと重みのファイル名
入力画像のファイル名
入力データフォーマットの width x height サイズ

Inference Engine

Inference Engineの使い方は前回と全く同じです

モジュール読み込み
ターゲットデバイスの指定
モデルの読み込み
推論実行

「モジュール読み込み」「ターゲットデバイスの指定」「推論実行」は全く同じです。
違いがあったのが「モデルの読み込み」です。

# 前回 
net = IENetwork(model='intel/landmarks-regression-retail-0009/FP16/landmarks-regression-retail-0009.xml', weights='intel/landmarks-regression-retail-0009/FP16/landmarks-regression-retail-0009.bin')

# 今回 
net  = IENetwork(model='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.xml', weights='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.bin')

モデルの重みとファイル名をランドマーク回帰から顔検出に変更しています

その他のコード

「その他のコード」についても同じ構成です。

入出力データのキー取得
入出力データのフォーマット確認
入力データを整える
出力データを取り出す

「入出力データのキー取得」「出力データを取り出す」は全く同じです。
「入出力データのフォーマット確認」と「入力データを整える」について見てゆきましょう。

入出力データのフォーマット確認

今回のモデルについては、インテルのこちらのサイトに詳細が書かれています。

Inputsの項目には以下のように書いてあります。

Inputs
name: "input" , shape: [1x3x300x300] - An input image in the format [BxCxHxW], where:
B - batch size
C - number of channels
H - image height
W - image width
Expected color order - BGR.

前回との違いは１つだけ。HxW のサイズが 300x300 になっています。

次にOutputsの項目を見てみます

Outputs
The net outputs a blob with shape: [1, 1, N, 7], where N is the number of detected bounding boxes. For each detection, the description has the format: [image_id, label, conf, x_min, y_min, x_max, y_max], where:
image_id - ID of the image in the batch
label - predicted class ID
conf - confidence for the predicted class
(x_min, y_min) - coordinates of the top left bounding box corner
(x_max, y_max) - coordinates of the bottom right bounding box corner.

これは前回と結構違います。ざっくりまとめると

要素数 7 個のリストが N 個ある
N は検出したバウンディングボックスの数
7 個の中身は [ image_id, label, conf, x_min, y_min, x_max, y_max ]
image_id はbatchのID番号
label は予測クラスID
conf は顔検出の信頼度
(x_min, y_min) はバウンディングボックスの左上の角座標
(x_max, y_max) はバウンディングボックスの右下の角座標

「バウンディングボックス」とは顔の領域を推定した四角形の枠です。
これでフォーマットが分かりましたので、この情報を元に入出力データを処理します。

入力データを整える

変更箇所のみ書き出します。
まずは、入力画像のファイル名を photo_face.jpg から face.jpg に変更しています。

# 前回 
frame = cv2.imread('image/photo_face.jpg')

# 今回 
frame = cv2.imread('image/photo.jpg')

次に、Inputsで変更のあったように、HeightとWidthのサイズを48x48から300x300へ変更しています。

# 前回 
img = cv2.resize(frame, (48, 48)) # HeightとWidth変更

# 今回 
img = cv2.resize(frame, (300, 300)) # HeightとWidth変更

その他は前回と同じ処理です。
これで画像データimgは要求されたフォーマットになりました

出力データを取り出す

outの中身を取り出して表示しているのは前回と同じです

[[ 0.          1.          1.         ...  0.18713379  0.5932617
   0.53271484]
 [-1.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]

このままだと...で結果が全部見えないので、out[0]のみ表示してみて下さい
最後に print(out[0]) と追記するだけです。

実行すると、以下のように7つの要素を持ったリストが得られたと思います。

[0.         1.         1.         0.41064453 0.18713379 0.5932617
 0.53271484]

この値は以下の表のような内容を意味しています。

image_id	label	conf	x_min	y_min	x_max	y_max
0.0	1.0	1.0	0.41064453	0.18713379	0.5932617	0.53271484

左上座標が (0.41064453, 0.18713379) で、右上座標が (0.5932617, 0.53271484)のバウンディングボックスということが分かりました。これを使って長方形を描けば顔の領域を示すことが出来そうです。

入力画像に出力結果を描画

プログラムとしては、x_minはout[0][3]、y_minはout[0][4]、x_maxはout[0][5]、y_maxはout[0][6]で各数値を取得できますので、これを活用しましょう。
次のコードをコピペして実行してみて下さい。

# import 
import cv2
import numpy as np
from openvino.inference_engine import IENetwork, IEPlugin
 
# ターゲットデバイスの指定 
plugin = IEPlugin(device='MYRIAD')
 
# モデルの読み込み 
net  = IENetwork(model='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.xml', weights='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.bin')
exec_net = plugin.load(network=net)
 
# 入出力データのキー取得 
input_blob = next(iter(net.inputs))
out_blob = next(iter(net.outputs))
 
# 画像読み込み 
frame = cv2.imread('image/photo.jpg')
 
# 入力データフォーマットへ変換 
img = cv2.resize(frame, (300, 300)) # HeightとWidth変更 
img = img.transpose((2, 0, 1))      # HWC > CHW 
img = np.expand_dims(img, axis=0)   # CHW > BCHW 
 
# 推論実行 
out = exec_net.infer(inputs={input_blob: img})
 
# 出力から必要なデータのみ取り出し 
out = out[out_blob]
 
# 不要な次元を削減 
out = np.squeeze(out)
 
# 中身を出力 
print(out)
 
# バウンディングボックス座標を入力画像のスケールに変換 
xmin = int(out[0][3] * frame.shape[1])
ymin = int(out[0][4] * frame.shape[0])
xmax = int(out[0][5] * frame.shape[1])
ymax = int(out[0][6] * frame.shape[0])
 
# バウンディングボックス表示 
cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), color=(89, 199, 243), thickness=3)
 
# 画像表示 
cv2.imshow('frame', frame)
 
# キーが入力されるまで待つ 
cv2.waitKey(0)
 
# 終了処理 
cv2.destroyAllWindows()

次のような画像が出力されたかと思います。

顔の領域に長方形（つまりバウンディングボックス）を表示できました！
最初のコードからの追加点は以下の部分のみです。

# バウンディングボックス座標を入力画像のスケールに変換 
xmin = int(out[0][3] * frame.shape[1])
ymin = int(out[0][4] * frame.shape[0])
xmax = int(out[0][5] * frame.shape[1])
ymax = int(out[0][6] * frame.shape[0])
 
# バウンディングボックス表示 
cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), color=(89, 199, 243), thickness=3)
 
# 画像表示 
cv2.imshow('frame', frame)
 
# キーが入力されるまで待つ 
cv2.waitKey(0)
 
# 終了処理 
cv2.destroyAllWindows()

前回同様、座標値は正規化されているため、frame.shape[1]とframe.shape[0]を使って、元の画像サイズの領域に変換しています。変換した座標値を使ってcv.rectangleを使って長方形を描いています。

複数の顔検出

先程は１つの顔だけでしたが、複数の顔が写っている画像の場合を考えてみましょう
こちらのページの画像を活用します。

ファイル名はphoto2.jpgに変更しました。

バウンディングボックスを描画する前に数値で確認したいと思います。先頭から20行のリストの各数値を表示してみることとします。以下のコードをコピペして実行して下さい。

# import 
import cv2
import numpy as np
from openvino.inference_engine import IENetwork, IEPlugin
 
# ターゲットデバイスの指定 
plugin = IEPlugin(device="MYRIAD")
 
# モデルの読み込み 
net  = IENetwork(model='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.xml', weights='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.bin')
exec_net = plugin.load(network=net)
 
# 入出力データのキー取得 
input_blob = next(iter(net.inputs))
out_blob = next(iter(net.outputs))
 
# 画像読み込み 
frame = cv2.imread('image/photo2.jpg')
 
# 入力データフォーマットへ変換 
img = cv2.resize(frame, (300, 300)) # HeightとWidth変更 
img = img.transpose((2, 0, 1))      # HWC > CHW 
img = np.expand_dims(img, axis=0)   # CHW > BCHW 
 
# 推論実行 
out = exec_net.infer(inputs={input_blob: img})
 
# 出力から必要なデータのみ取り出し 
out = out[out_blob]
 
# 不要な次元を削減 
out = np.squeeze(out)
 
# 中身を20行だけ出力 
for i in range(20):
    print(out[i])

変更箇所は入力画像のファイル名と最後の出力のみです。
以下のように表示されたかと思います。（見やすいように１部の改行を削除しています）

[0.         1.         0.9995117  0.61328125 0.21655273 0.74121094 0.4572754]
[0.         1.         0.99902344 0.29516602 0.23730469 0.43823242 0.4716797]
[0.         1.         0.12109375 0.4790039  0.3203125  0.5258789 0.3955078 ]
[0.         1.         0.04101562 0.12561035 0.27612305 0.14807129 0.3112793]
[0.         1.         0.04052734 0.49072266 0.26611328 0.5151367 0.3046875 ]
[0.         1.         0.03857422 0.9555664  0.2866211  0.9995117 0.3857422 ]
[0.         1.         0.03466797 0.58154297 0.26953125 0.6176758 0.3359375 ]
[0.         1.         0.03222656 0.58203125 0.31152344 0.6230469 0.3881836 ]
[0.         1.         0.02929688 0.7050781  0.6694336  0.81640625 0.7817383]
[0.         1.         0.02490234 0.49145508 0.2310791  0.5161133 0.27124023]
[0.         1.         0.02392578 0.5371094  0.27807617 0.5595703 0.31518555]
[0.         1.         0.02197266 0.03356934 0.23742676 0.0614624 0.2770996 ]
[0.         1.         0.02099609 0.5854492  0.36791992 0.62646484 0.44311523]
[0.         1.         0.02001953 0.5317383  0.31591797 0.56689453 0.37597656]
[-1.  0.  0.  0.  0.  0.  0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0.]

[ image_id, label, conf, x_min, y_min, x_max, y_max ]

列方向について、0列目から数えるものとします。
この結果から見てまず分かるのは、顔検出は14箇所で行われていそうだということです。
理由は、3列目～6列目のバウンディングボックス座標(x_min y_min x_max y_max)に着目すると、全て0.になっている行以外を数えると14行あるためです。
0列目のimage_idに着目すると、基本的には0ですが、顔検出できなくなった行で-1になっているようです。
1列目のlabelに着目すると、顔検出されている行では1.、それ以外では0.ということも分かります。
2列目のconfに着目すると、降順（大きい数値から順）に並んでいることが分かります。

先程の画像で検出して欲しい顔は２つですが、このまま表示すると余計な箇所までバウンディングボックスが表示されそうです。

再び2列目のconfに着目してみましょう。confは顔検出の信用度であり、1.0 に近いほど信用が高いことを表しています。最初の２行は0.999でほぼ 1.0 であるのに対し、次の行は0.121となっているのが分かります。どうやらこのconfに着目して、指定した数値以上のときのみバウンディングボックスを表示させれば良さそうです。
今回は 0.5 という数値にしてみましょう。ちなみに、このような数値のことを閾値（しきいち）と呼んでいます。
以上の内容を踏まえたコードがこちらです。コピペして実行してみて下さい

# import 
import cv2
import numpy as np
from openvino.inference_engine import IENetwork, IEPlugin
 
# ターゲットデバイスの指定 
plugin = IEPlugin(device='MYRIAD')
 
# モデルの読み込み 
net  = IENetwork(model='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.xml', weights='intel/face-detection-retail-0005/FP16/face-detection-retail-0005.bin')
exec_net = plugin.load(network=net)
 
# 入出力データのキー取得 
input_blob = next(iter(net.inputs))
out_blob = next(iter(net.outputs))
 
# 画像読み込み 
frame = cv2.imread('image/photo2.jpg')
 
# 入力データフォーマットへ変換 
img = cv2.resize(frame, (300, 300)) # HeightとWidth変更 
img = img.transpose((2, 0, 1))      # HWC > CHW 
img = np.expand_dims(img, axis=0)   # CHW > BCHW 
 
# 推論実行 
out = exec_net.infer(inputs={input_blob: img})
 
# 出力から必要なデータのみ取り出し 
out = out[out_blob]
 
# 不要な次元を削減 
out = np.squeeze(out)
 
# 検出されたすべての顔領域に対して１つずつ処理 
for detection in out:
    # conf値の取得 
    confidence = float(detection[2])
 
    # バウンディングボックス座標を入力画像のスケールに変換 
    xmin = int(detection[3] * frame.shape[1])
    ymin = int(detection[4] * frame.shape[0])
    xmax = int(detection[5] * frame.shape[1])
    ymax = int(detection[6] * frame.shape[0])
 
    # conf値が0.5より大きい場合のみバウンディングボックス表示 
    if confidence > 0.5:
        # バウンディングボックス表示 
        cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), color=(89, 199, 243), thickness=3)
 
# 画像表示 
cv2.imshow('frame', frame)
 
# キーが入力されるまで待つ 
cv2.waitKey(0)
 
# 終了処理 
cv2.destroyAllWindows()