Using TFReocrd File

6 minute read

Using TFReocrd File

이전 Post에서 TFRecord File Format을 만드는 방법에 대해서 알아보았습니다.
이번 Post에서는 지난 번에 작성한 TFRecord Dataset으로 Image Classification을 해보겠습니다.
TFRecord는 Tensorflow와 함께 사용할 때 최고의 성능을 보여줍니다. 그래서, 가능하면 모든 Code들은 Tensorflow에서 제공하는 Function들을 사용해서 작성해 보도록 하겠습니다.

0. Prepare

필요한 Module을 Load합니다.

import tensorflow as tf
from tqdm import tqdm
from sklearn.model_selection import train_test_split

Batch Size와 Prefetch할 Size를 미리 정의합니다.

class CFG:
    BATCH_SIZE = 32
    BUFFER_SIZE = 500

추후에 map에 적용하기 위해서 TFRecord File들의 Full Path를 구해놓습니다.

Cat_Fearue_File_List = tf.io.gfile.listdir("./PetImages/Cat/TFRecord")
Cat_Fearue_File_List = list(map(lambda x:"./PetImages/Cat/TFRecord/" + x, Cat_Fearue_File_List))

print(len(Cat_Fearue_File_List))
Cat_Fearue_File_List[:10]

12427

['./PetImages/Cat/TFRecord/0.tfrecord',
 './PetImages/Cat/TFRecord/1.tfrecord',
 './PetImages/Cat/TFRecord/10.tfrecord',
 './PetImages/Cat/TFRecord/100.tfrecord',
 './PetImages/Cat/TFRecord/1000.tfrecord',
 './PetImages/Cat/TFRecord/10000.tfrecord',
 './PetImages/Cat/TFRecord/10001.tfrecord',
 './PetImages/Cat/TFRecord/10002.tfrecord',
 './PetImages/Cat/TFRecord/10003.tfrecord',
 './PetImages/Cat/TFRecord/10004.tfrecord']

Dog_Fearue_File_List = tf.io.gfile.listdir("./PetImages/Dog/TFRecord")
Dog_Fearue_File_List = list(map(lambda x:"./PetImages/Dog/TFRecord/" + x, Dog_Fearue_File_List))

print(len(Dog_Fearue_File_List))
Dog_Fearue_File_List[:10]

12397

['./PetImages/Dog/TFRecord/0.tfrecord',
 './PetImages/Dog/TFRecord/1.tfrecord',
 './PetImages/Dog/TFRecord/10.tfrecord',
 './PetImages/Dog/TFRecord/100.tfrecord',
 './PetImages/Dog/TFRecord/1000.tfrecord',
 './PetImages/Dog/TFRecord/10000.tfrecord',
 './PetImages/Dog/TFRecord/10001.tfrecord',
 './PetImages/Dog/TFRecord/10002.tfrecord',
 './PetImages/Dog/TFRecord/10003.tfrecord',
 './PetImages/Dog/TFRecord/10004.tfrecord']

1. Train & Validation Set Split

Train시에 사용할 Train / Val. Set을 분리하겠습니다.
Cat / Dog 별로 동일한 8:2로 나누겠습니다.

Cat_Train_File_List, Cat_Val_File_List = train_test_split(Cat_Fearue_File_List, test_size=0.2, random_state=123)

print("Cat Train : ",len(Cat_Train_File_List) , "Cat Val. : ",len(Cat_Val_File_List))

Cat Train :  9941 Cat Val. :  2486

Dog_Train_File_List, Dog_Val_File_List = train_test_split(Dog_Fearue_File_List, test_size=0.2, random_state=123)

print("Dog Train : ",len(Dog_Train_File_List) , "Dog Val. : ",len(Dog_Val_File_List))

Dog Train :  9917 Dog Val. :  2480

Cat / Dog의 Train File List와 Val. File List를 합칩니다.

Train_Feature_File_List = Cat_Train_File_List + Dog_Train_File_List

Val_Feature_File_List = Cat_Val_File_List + Dog_Val_File_List

print(len(Train_Feature_File_List) , len(Val_Feature_File_List) )

19858 4966

잘 섞어줍니다.

Train_Feature_File_List = tf.random.shuffle(Train_Feature_File_List)

Train_Feature_File_List

<tf.Tensor: shape=(19858,), dtype=string, numpy=
array([b'./PetImages/Dog/TFRecord/10425.tfrecord',
       b'./PetImages/Cat/TFRecord/3119.tfrecord',
       b'./PetImages/Cat/TFRecord/5364.tfrecord', ...,
       b'./PetImages/Dog/TFRecord/9267.tfrecord',
       b'./PetImages/Dog/TFRecord/12042.tfrecord',
       b'./PetImages/Dog/TFRecord/551.tfrecord'], dtype=object)>

Val_Feature_File_List = tf.random.shuffle(Val_Feature_File_List)

Val_Feature_File_List

<tf.Tensor: shape=(4966,), dtype=string, numpy=
array([b'./PetImages/Dog/TFRecord/4587.tfrecord',
       b'./PetImages/Cat/TFRecord/387.tfrecord',
       b'./PetImages/Cat/TFRecord/5345.tfrecord', ...,
       b'./PetImages/Dog/TFRecord/2247.tfrecord',
       b'./PetImages/Dog/TFRecord/12447.tfrecord',
       b'./PetImages/Cat/TFRecord/9419.tfrecord'], dtype=object)>

2. Making Dataset

이제 TFRecord File을 읽어서 Dataset을 만들도록 하겠습니다.
전체적인 순서는
- TFRecord File들의 Full Path를 가지고 Dataset 생성
- TFRecord File을 읽어서 Contents를 읽어오는 Map Function 작성 / 적용
- Dataset에 shuffle / batch / prefetch 적용
- Train에 적용
하나씩 알아보겠습니다.

2.1. Map Function

TFRecord File의 내용을 Decoding하는 Function입니다.

def map_fn(serialized_example):
- Parameter로 넘어오는 serialized_example는 TFRecord File의 Full Path입니다.
- 아래쪽에 TFRecordDataset을 이용해서 Dataset을 만들때 Parameter로 사용하는 값이 serialized_example로 넘어오는 것입니다.

feature = { ‘Feature’: tf.io.FixedLenFeature([49*1280], tf.float32), ‘Label’: tf.io.FixedLenFeature([1], tf.int64) }
- 우리가 읽을 TFRecord File의 구조를 정의하는 부분입니다.
- ‘Feature’라는 Data는 Float형, 길이가 62720이며, ‘Label’이라는 Data는 Type이 Int형, 길이가 1이라는 뜻입니다.
- 가만히 생각해 보면, 우리가 다른 사람이 작성한 TFRecord File을 읽어야 하는 경우에는 이러한 구조를 모르면 Decoding할 수가 없습니다.
- 즉, TFRecord Format으로 Dataset을 배포할 때는 반드시 이런 구조 정보를 함께 알려주어야 하는 것입니다.

example = tf.io.parse_single_example(serialized_example, feature)
- 실제 File을 읽어서 위에서 정의한 구조대로 Decoding하는 부분입니다.

example[‘Feature’] = tf.reshape( example[‘Feature’] , (7,7,1280) )
- 우리가 TFRecord File을 저장할 때, Shape을 Flatten했기 때문에, 원래 모양대로 원상 복구합니다.

tf.squeeze( tf.one_hot(example[‘Label’] , depth=2) )
- Label은 Cat인 경우 0, Dog인 경우에 1입니다.
- One Hot 형식으로 변경하는 부분입니다.

def map_fn(serialized_example):

    feature = {
        'Feature': tf.io.FixedLenFeature([49*1280], tf.float32),
        'Label': tf.io.FixedLenFeature([1], tf.int64)
    }
    
    example = tf.io.parse_single_example(serialized_example, feature)
    
    example['Feature'] = tf.reshape( example['Feature'] , (7,7,1280) )
    
    return example['Feature'], tf.squeeze( tf.one_hot(example['Label'] , depth=2) )

2.2. Define Dataset

우리는 TFRecord File List가 있기 때문에, Dataset 생성에 tf.data.TFRecordDataset을 사용하도록 하겠습니다.

Train_Dataset = tf.data.TFRecordDataset( Train_Feature_File_List )
Val_Dataset = tf.data.TFRecordDataset( Val_Feature_File_List )

Train_Dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

Val_Dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

Dataset도 만들었고, Dataset에 적용할 Map Function도 만들었으니 이제 shuffle / batch / prefetch를 적용합니다.

Train_Dataset = Train_Dataset.map(map_fn , 
                      num_parallel_calls=tf.data.experimental.AUTOTUNE)

Train_Dataset = Train_Dataset.shuffle(CFG.BUFFER_SIZE).batch(CFG.BATCH_SIZE)
Train_Dataset = Train_Dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Val_Dataset = Val_Dataset.map(map_fn , 
                      num_parallel_calls=tf.data.experimental.AUTOTUNE)

Val_Dataset = Val_Dataset.shuffle(CFG.BUFFER_SIZE).batch(CFG.BATCH_SIZE)
Val_Dataset = Val_Dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

제대로 읽는지 하나만 살펴볼까요?

for batch in Train_Dataset.take(1):
    batch

print(batch)
batch[1].numpy()

(<tf.Tensor: shape=(32, 7, 7, 1280), dtype=float32, numpy=
array([[[[-1.30283579e-01, -1.79280803e-01, -2.73918390e-01, ...,
          -2.35235289e-01, -2.51564652e-01, -1.24651089e-01],
         [-2.23974511e-01, -2.69972891e-01, -2.78214544e-01, ...,
          -1.61265686e-01, -2.63588488e-01,  4.82403219e-01],
         [-9.15986970e-02, -7.91421384e-02, -2.09058121e-01, ...,
          -2.46901169e-01, -2.35181734e-01, -2.75620580e-01],
         ...,
         [-4.28934097e-02, -3.80189456e-02, -9.37430859e-02, ...,
          -2.70000398e-01, -2.56699502e-01,  3.71812642e-01],
         [-3.37041589e-03,  2.37213349e+00, -2.73799729e-02, ...,
          -2.78330237e-01, -2.63371170e-01, -2.72408187e-01],
         [-2.69764900e-01, -9.48898420e-02, -6.93319291e-02, ...,
          -2.59258837e-01, -2.78016418e-01, -2.23277569e-01]],

        [[-2.40731761e-01, -1.49252594e-01, -1.29014462e-01, ...,
          -3.48842107e-02, -2.78116941e-01,  1.21828353e+00],
         [-2.78438926e-01,  5.36722839e-01, -1.36655658e-01, ...,
          -6.23840746e-03, -2.26522133e-01,  1.85954738e+00],
         [-6.25507459e-02, -1.00828111e-01, -1.55045748e-01, ...,
          -4.72838171e-02, -1.98967606e-01, -2.27459908e-01],
         ...,
         [-4.31086839e-04,  2.16324544e+00, -1.06345741e-02, ...,
          -3.63434367e-02, -3.13317887e-02,  9.24772203e-01],
         [-1.53233509e-06,  1.35951967e+01, -7.76352419e-04, ...,
          -4.41170437e-03, -2.31151477e-01, -2.63853192e-01],
         [-6.46062661e-03,  4.92377949e+00, -1.39266048e-02, ...,
          -5.87231778e-02,  1.09022892e+00, -2.71575540e-01]],

        [[-1.42593503e-01,  3.51380378e-01, -2.29087830e-01, ...,
          -9.77664907e-03, -2.43833497e-01,  6.59770250e-01],
         [-6.03562184e-02,  2.67279983e+00, -8.22568163e-02, ...,
          -3.87481693e-03, -3.45014818e-02,  2.45907426e+00],
         [-6.38377480e-03,  4.15273046e+00, -1.50325801e-02, ...,
          -8.94669294e-02, -1.27345668e-02, -2.78422236e-01],
         ...,
         [-3.85519373e-03, -1.10767812e-01, -3.39450780e-03, ...,
          -1.03545524e-02, -8.26057419e-03,  1.70724523e+00],
         [-5.78504521e-04,  6.60518837e+00, -2.84841983e-04, ...,
          -3.46982223e-03, -2.71046907e-02, -9.67810899e-02],
         [-5.40567897e-02, -2.77622372e-01, -3.78229097e-02, ...,
          -6.76211938e-02, -2.77685404e-01, -2.49596104e-01]],

        ...,


        [[ 2.25460219e+00, -2.29545057e-01,  7.19489276e-01, ...,
          -1.46996761e-02, -2.56974429e-01,  2.10201070e-01],
         [ 9.66896489e-02, -2.15796664e-01, -2.50947535e-01, ...,
          -2.59889185e-01, -2.60895252e-01,  1.05790734e-01],
         [-2.78464437e-01, -2.75186807e-01, -2.66715109e-01, ...,
          -2.64783144e-01, -1.05178565e-01,  4.61720973e-01],
         ...,
         [ 1.15485799e+00, -1.55356184e-01, -1.62961870e-01, ...,
          -7.94149265e-02, -2.78311014e-01,  7.28313923e-01],
         [ 2.17598176e+00, -8.69607255e-02,  2.29728460e-01, ...,
          -1.48992896e-01,  2.41546303e-01,  4.84176666e-01],
         [ 1.63772357e+00, -4.06442173e-02,  1.13258696e+00, ...,
          -2.36238405e-01, -2.71336257e-01,  6.56960234e-02]]]],
      dtype=float32)>, <tf.Tensor: shape=(32, 2), dtype=float32, numpy=
array([[0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]], dtype=float32)>)





array([[0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]], dtype=float32)

take(1)만 해도, Batch Size만큼 읽는다는 것을 알 수 있습니다.
내용을 보니 제대로 읽어오는 것 같습니다.

3. Model Define

Data도 준비되었으니, EfficientNet에서 추출한 Feature를 분류할 Simple Dense Net을 하나 정의하겠습니다.
Input Shape 신경 써주고, 적당하게 만들어 줍니다.

model = tf.keras.Sequential()

model.add( tf.keras.layers.InputLayer(input_shape=(7,7,1280)) )
model.add( tf.keras.layers.GlobalAveragePooling2D() )

model.add( tf.keras.layers.BatchNormalization() )
model.add( tf.keras.layers.Dropout(0.25) )
model.add( tf.keras.layers.Dense(256 , activation='relu') )

model.add( tf.keras.layers.BatchNormalization() )
model.add( tf.keras.layers.Dropout(0.25) )
model.add( tf.keras.layers.Dense(64 , activation='relu') )

model.add( tf.keras.layers.BatchNormalization() )
model.add( tf.keras.layers.Dropout(0.25) )
model.add( tf.keras.layers.Dense(2 , activation='softmax') )

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
global_average_pooling2d (Gl (None, 1280)              0         
_________________________________________________________________
batch_normalization (BatchNo (None, 1280)              5120      
_________________________________________________________________
dropout (Dropout)            (None, 1280)              0         
_________________________________________________________________
dense (Dense)                (None, 256)               327936    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                16448     
_________________________________________________________________
batch_normalization_2 (Batch (None, 64)                256       
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 130       
=================================================================
Total params: 350,914
Trainable params: 347,714
Non-trainable params: 3,200
_________________________________________________________________

Optimizer 선택해서 설정하고, Compile한 후에 Train시작하겠습니다.

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-2)

model.compile(optimizer=optimizer, 
              loss="categorical_crossentropy", metrics=["accuracy"]
             )

EPOCHS = 5

hist = model.fit(Train_Dataset, 
                 epochs=EPOCHS, 
                 validation_data=Val_Dataset, 
                 verbose=1)

Epoch 1/5
621/621 [==============================] - 285s 458ms/step - loss: 0.0579 - accuracy: 0.9812 - val_loss: 0.0319 - val_accuracy: 0.9887
Epoch 2/5
621/621 [==============================] - 362s 582ms/step - loss: 0.0383 - accuracy: 0.9867 - val_loss: 0.0371 - val_accuracy: 0.9881
Epoch 3/5
621/621 [==============================] - 384s 618ms/step - loss: 0.0339 - accuracy: 0.9892 - val_loss: 0.0327 - val_accuracy: 0.9895
Epoch 4/5
621/621 [==============================] - 473s 761ms/step - loss: 0.0355 - accuracy: 0.9880 - val_loss: 0.0291 - val_accuracy: 0.9899
Epoch 5/5
621/621 [==============================] - 436s 702ms/step - loss: 0.0331 - accuracy: 0.9902 - val_loss: 0.0357 - val_accuracy: 0.9895

첫번째 Epoch에서 Train / Val. Set 모두에서 이미 높은 Accuracy를 보이네요.
제대로 동작하는 것 같네요.

import matplotlib.pyplot as plt

plt.plot(hist.history["accuracy"])
plt.plot(hist.history["val_accuracy"])
plt.title("model accuracy")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(["train", "validation"], loc="upper left")
plt.show()

4. Summary

이번 Post에서는 TFReocrd File Format을 읽어서 실제 Train에 적용시키는 방법까지 알아보았습니다.
중요한 것은 TFReocrd File을 사용하기 위해서는 해당 Dataset을 만들때 사용한 구조( Feature )를 반드시 알고 있어야 한다는 것입니다.

Twitter Facebook LinkedIn

MoonLight

Using TFReocrd File

Using TFReocrd File

0. Prepare

1. Train & Validation Set Split

2. Making Dataset

2.1. Map Function

2.2. Define Dataset

3. Model Define

4. Summary

You May Also Enjoy

실무로 통하는 ML 문제 해결 with 파이썬

U-Net : Convolutional Networks for Biomedical Image Segmentation

Transpose Convolution

1x1 Convolution