6 minute read

Using TFReocrd File




  • 이전 Post에서 TFRecord File Format을 만드는 방법에 대해서 알아보았습니다.

  • 이번 Post에서는 지난 번에 작성한 TFRecord Dataset으로 Image Classification을 해보겠습니다.

  • TFRecord는 Tensorflow와 함께 사용할 때 최고의 성능을 보여줍니다. 그래서, 가능하면 모든 Code들은 Tensorflow에서 제공하는 Function들을 사용해서 작성해 보도록 하겠습니다.







0. Prepare

  • 필요한 Module을 Load합니다.
import tensorflow as tf
from tqdm import tqdm
from sklearn.model_selection import train_test_split




  • Batch Size와 Prefetch할 Size를 미리 정의합니다.
class CFG:
    BATCH_SIZE = 32
    BUFFER_SIZE = 500




  • 추후에 map에 적용하기 위해서 TFRecord File들의 Full Path를 구해놓습니다.
Cat_Fearue_File_List = tf.io.gfile.listdir("./PetImages/Cat/TFRecord")
Cat_Fearue_File_List = list(map(lambda x:"./PetImages/Cat/TFRecord/" + x, Cat_Fearue_File_List))

print(len(Cat_Fearue_File_List))
Cat_Fearue_File_List[:10]
12427

['./PetImages/Cat/TFRecord/0.tfrecord',
 './PetImages/Cat/TFRecord/1.tfrecord',
 './PetImages/Cat/TFRecord/10.tfrecord',
 './PetImages/Cat/TFRecord/100.tfrecord',
 './PetImages/Cat/TFRecord/1000.tfrecord',
 './PetImages/Cat/TFRecord/10000.tfrecord',
 './PetImages/Cat/TFRecord/10001.tfrecord',
 './PetImages/Cat/TFRecord/10002.tfrecord',
 './PetImages/Cat/TFRecord/10003.tfrecord',
 './PetImages/Cat/TFRecord/10004.tfrecord']




Dog_Fearue_File_List = tf.io.gfile.listdir("./PetImages/Dog/TFRecord")
Dog_Fearue_File_List = list(map(lambda x:"./PetImages/Dog/TFRecord/" + x, Dog_Fearue_File_List))

print(len(Dog_Fearue_File_List))
Dog_Fearue_File_List[:10]
12397

['./PetImages/Dog/TFRecord/0.tfrecord',
 './PetImages/Dog/TFRecord/1.tfrecord',
 './PetImages/Dog/TFRecord/10.tfrecord',
 './PetImages/Dog/TFRecord/100.tfrecord',
 './PetImages/Dog/TFRecord/1000.tfrecord',
 './PetImages/Dog/TFRecord/10000.tfrecord',
 './PetImages/Dog/TFRecord/10001.tfrecord',
 './PetImages/Dog/TFRecord/10002.tfrecord',
 './PetImages/Dog/TFRecord/10003.tfrecord',
 './PetImages/Dog/TFRecord/10004.tfrecord']








1. Train & Validation Set Split

  • Train시에 사용할 Train / Val. Set을 분리하겠습니다.

  • Cat / Dog 별로 동일한 8:2로 나누겠습니다.



Cat_Train_File_List, Cat_Val_File_List = train_test_split(Cat_Fearue_File_List, test_size=0.2, random_state=123)
print("Cat Train : ",len(Cat_Train_File_List) , "Cat Val. : ",len(Cat_Val_File_List))
Cat Train :  9941 Cat Val. :  2486




Dog_Train_File_List, Dog_Val_File_List = train_test_split(Dog_Fearue_File_List, test_size=0.2, random_state=123)
print("Dog Train : ",len(Dog_Train_File_List) , "Dog Val. : ",len(Dog_Val_File_List))
Dog Train :  9917 Dog Val. :  2480




  • Cat / Dog의 Train File List와 Val. File List를 합칩니다.
Train_Feature_File_List = Cat_Train_File_List + Dog_Train_File_List
Val_Feature_File_List = Cat_Val_File_List + Dog_Val_File_List
print(len(Train_Feature_File_List) , len(Val_Feature_File_List) )
19858 4966


  • 잘 섞어줍니다.
Train_Feature_File_List = tf.random.shuffle(Train_Feature_File_List)
Train_Feature_File_List


<tf.Tensor: shape=(19858,), dtype=string, numpy=
array([b'./PetImages/Dog/TFRecord/10425.tfrecord',
       b'./PetImages/Cat/TFRecord/3119.tfrecord',
       b'./PetImages/Cat/TFRecord/5364.tfrecord', ...,
       b'./PetImages/Dog/TFRecord/9267.tfrecord',
       b'./PetImages/Dog/TFRecord/12042.tfrecord',
       b'./PetImages/Dog/TFRecord/551.tfrecord'], dtype=object)>
Val_Feature_File_List = tf.random.shuffle(Val_Feature_File_List)
Val_Feature_File_List
<tf.Tensor: shape=(4966,), dtype=string, numpy=
array([b'./PetImages/Dog/TFRecord/4587.tfrecord',
       b'./PetImages/Cat/TFRecord/387.tfrecord',
       b'./PetImages/Cat/TFRecord/5345.tfrecord', ...,
       b'./PetImages/Dog/TFRecord/2247.tfrecord',
       b'./PetImages/Dog/TFRecord/12447.tfrecord',
       b'./PetImages/Cat/TFRecord/9419.tfrecord'], dtype=object)>








2. Making Dataset

  • 이제 TFRecord File을 읽어서 Dataset을 만들도록 하겠습니다.

  • 전체적인 순서는
    • TFRecord File들의 Full Path를 가지고 Dataset 생성
    • TFRecord File을 읽어서 Contents를 읽어오는 Map Function 작성 / 적용
    • Dataset에 shuffle / batch / prefetch 적용
    • Train에 적용
  • 하나씩 알아보겠습니다.




2.1. Map Function

  • TFRecord File의 내용을 Decoding하는 Function입니다.


  • def map_fn(serialized_example):
    • Parameter로 넘어오는 serialized_example는 TFRecord File의 Full Path입니다.
    • 아래쪽에 TFRecordDataset을 이용해서 Dataset을 만들때 Parameter로 사용하는 값이 serialized_example로 넘어오는 것입니다.


  • feature = { ‘Feature’: tf.io.FixedLenFeature([49*1280], tf.float32), ‘Label’: tf.io.FixedLenFeature([1], tf.int64) }

    • 우리가 읽을 TFRecord File의 구조를 정의하는 부분입니다.
    • ‘Feature’라는 Data는 Float형, 길이가 62720이며, ‘Label’이라는 Data는 Type이 Int형, 길이가 1이라는 뜻입니다.
    • 가만히 생각해 보면, 우리가 다른 사람이 작성한 TFRecord File을 읽어야 하는 경우에는 이러한 구조를 모르면 Decoding할 수가 없습니다.
    • 즉, TFRecord Format으로 Dataset을 배포할 때는 반드시 이런 구조 정보를 함께 알려주어야 하는 것입니다.


  • example = tf.io.parse_single_example(serialized_example, feature)
    • 실제 File을 읽어서 위에서 정의한 구조대로 Decoding하는 부분입니다.


  • example[‘Feature’] = tf.reshape( example[‘Feature’] , (7,7,1280) )
    • 우리가 TFRecord File을 저장할 때, Shape을 Flatten했기 때문에, 원래 모양대로 원상 복구합니다.


  • tf.squeeze( tf.one_hot(example[‘Label’] , depth=2) )
    • Label은 Cat인 경우 0, Dog인 경우에 1입니다.
    • One Hot 형식으로 변경하는 부분입니다.



def map_fn(serialized_example):

    feature = {
        'Feature': tf.io.FixedLenFeature([49*1280], tf.float32),
        'Label': tf.io.FixedLenFeature([1], tf.int64)
    }
    
    example = tf.io.parse_single_example(serialized_example, feature)
    
    example['Feature'] = tf.reshape( example['Feature'] , (7,7,1280) )
    
    return example['Feature'], tf.squeeze( tf.one_hot(example['Label'] , depth=2) )




2.2. Define Dataset

  • 우리는 TFRecord File List가 있기 때문에, Dataset 생성에 tf.data.TFRecordDataset을 사용하도록 하겠습니다.
Train_Dataset = tf.data.TFRecordDataset( Train_Feature_File_List )
Val_Dataset = tf.data.TFRecordDataset( Val_Feature_File_List )
Train_Dataset
<TFRecordDatasetV2 shapes: (), types: tf.string>
Val_Dataset
<TFRecordDatasetV2 shapes: (), types: tf.string>




  • Dataset도 만들었고, Dataset에 적용할 Map Function도 만들었으니 이제 shuffle / batch / prefetch를 적용합니다.
Train_Dataset = Train_Dataset.map(map_fn , 
                      num_parallel_calls=tf.data.experimental.AUTOTUNE)

Train_Dataset = Train_Dataset.shuffle(CFG.BUFFER_SIZE).batch(CFG.BATCH_SIZE)
Train_Dataset = Train_Dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
Val_Dataset = Val_Dataset.map(map_fn , 
                      num_parallel_calls=tf.data.experimental.AUTOTUNE)

Val_Dataset = Val_Dataset.shuffle(CFG.BUFFER_SIZE).batch(CFG.BATCH_SIZE)
Val_Dataset = Val_Dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)




  • 제대로 읽는지 하나만 살펴볼까요?
for batch in Train_Dataset.take(1):
    batch
print(batch)
batch[1].numpy()
(<tf.Tensor: shape=(32, 7, 7, 1280), dtype=float32, numpy=
array([[[[-1.30283579e-01, -1.79280803e-01, -2.73918390e-01, ...,
          -2.35235289e-01, -2.51564652e-01, -1.24651089e-01],
         [-2.23974511e-01, -2.69972891e-01, -2.78214544e-01, ...,
          -1.61265686e-01, -2.63588488e-01,  4.82403219e-01],
         [-9.15986970e-02, -7.91421384e-02, -2.09058121e-01, ...,
          -2.46901169e-01, -2.35181734e-01, -2.75620580e-01],
         ...,
         [-4.28934097e-02, -3.80189456e-02, -9.37430859e-02, ...,
          -2.70000398e-01, -2.56699502e-01,  3.71812642e-01],
         [-3.37041589e-03,  2.37213349e+00, -2.73799729e-02, ...,
          -2.78330237e-01, -2.63371170e-01, -2.72408187e-01],
         [-2.69764900e-01, -9.48898420e-02, -6.93319291e-02, ...,
          -2.59258837e-01, -2.78016418e-01, -2.23277569e-01]],

        [[-2.40731761e-01, -1.49252594e-01, -1.29014462e-01, ...,
          -3.48842107e-02, -2.78116941e-01,  1.21828353e+00],
         [-2.78438926e-01,  5.36722839e-01, -1.36655658e-01, ...,
          -6.23840746e-03, -2.26522133e-01,  1.85954738e+00],
         [-6.25507459e-02, -1.00828111e-01, -1.55045748e-01, ...,
          -4.72838171e-02, -1.98967606e-01, -2.27459908e-01],
         ...,
         [-4.31086839e-04,  2.16324544e+00, -1.06345741e-02, ...,
          -3.63434367e-02, -3.13317887e-02,  9.24772203e-01],
         [-1.53233509e-06,  1.35951967e+01, -7.76352419e-04, ...,
          -4.41170437e-03, -2.31151477e-01, -2.63853192e-01],
         [-6.46062661e-03,  4.92377949e+00, -1.39266048e-02, ...,
          -5.87231778e-02,  1.09022892e+00, -2.71575540e-01]],

        [[-1.42593503e-01,  3.51380378e-01, -2.29087830e-01, ...,
          -9.77664907e-03, -2.43833497e-01,  6.59770250e-01],
         [-6.03562184e-02,  2.67279983e+00, -8.22568163e-02, ...,
          -3.87481693e-03, -3.45014818e-02,  2.45907426e+00],
         [-6.38377480e-03,  4.15273046e+00, -1.50325801e-02, ...,
          -8.94669294e-02, -1.27345668e-02, -2.78422236e-01],
         ...,
         [-3.85519373e-03, -1.10767812e-01, -3.39450780e-03, ...,
          -1.03545524e-02, -8.26057419e-03,  1.70724523e+00],
         [-5.78504521e-04,  6.60518837e+00, -2.84841983e-04, ...,
          -3.46982223e-03, -2.71046907e-02, -9.67810899e-02],
         [-5.40567897e-02, -2.77622372e-01, -3.78229097e-02, ...,
          -6.76211938e-02, -2.77685404e-01, -2.49596104e-01]],

        ...,


        [[ 2.25460219e+00, -2.29545057e-01,  7.19489276e-01, ...,
          -1.46996761e-02, -2.56974429e-01,  2.10201070e-01],
         [ 9.66896489e-02, -2.15796664e-01, -2.50947535e-01, ...,
          -2.59889185e-01, -2.60895252e-01,  1.05790734e-01],
         [-2.78464437e-01, -2.75186807e-01, -2.66715109e-01, ...,
          -2.64783144e-01, -1.05178565e-01,  4.61720973e-01],
         ...,
         [ 1.15485799e+00, -1.55356184e-01, -1.62961870e-01, ...,
          -7.94149265e-02, -2.78311014e-01,  7.28313923e-01],
         [ 2.17598176e+00, -8.69607255e-02,  2.29728460e-01, ...,
          -1.48992896e-01,  2.41546303e-01,  4.84176666e-01],
         [ 1.63772357e+00, -4.06442173e-02,  1.13258696e+00, ...,
          -2.36238405e-01, -2.71336257e-01,  6.56960234e-02]]]],
      dtype=float32)>, <tf.Tensor: shape=(32, 2), dtype=float32, numpy=
array([[0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]], dtype=float32)>)





array([[0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]], dtype=float32)



  • take(1)만 해도, Batch Size만큼 읽는다는 것을 알 수 있습니다.

  • 내용을 보니 제대로 읽어오는 것 같습니다.








3. Model Define

  • Data도 준비되었으니, EfficientNet에서 추출한 Feature를 분류할 Simple Dense Net을 하나 정의하겠습니다.

  • Input Shape 신경 써주고, 적당하게 만들어 줍니다.

model = tf.keras.Sequential()

model.add( tf.keras.layers.InputLayer(input_shape=(7,7,1280)) )
model.add( tf.keras.layers.GlobalAveragePooling2D() )

model.add( tf.keras.layers.BatchNormalization() )
model.add( tf.keras.layers.Dropout(0.25) )
model.add( tf.keras.layers.Dense(256 , activation='relu') )

model.add( tf.keras.layers.BatchNormalization() )
model.add( tf.keras.layers.Dropout(0.25) )
model.add( tf.keras.layers.Dense(64 , activation='relu') )

model.add( tf.keras.layers.BatchNormalization() )
model.add( tf.keras.layers.Dropout(0.25) )
model.add( tf.keras.layers.Dense(2 , activation='softmax') )
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
global_average_pooling2d (Gl (None, 1280)              0         
_________________________________________________________________
batch_normalization (BatchNo (None, 1280)              5120      
_________________________________________________________________
dropout (Dropout)            (None, 1280)              0         
_________________________________________________________________
dense (Dense)                (None, 256)               327936    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                16448     
_________________________________________________________________
batch_normalization_2 (Batch (None, 64)                256       
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 130       
=================================================================
Total params: 350,914
Trainable params: 347,714
Non-trainable params: 3,200
_________________________________________________________________




  • Optimizer 선택해서 설정하고, Compile한 후에 Train시작하겠습니다.
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-2)
model.compile(optimizer=optimizer, 
              loss="categorical_crossentropy", metrics=["accuracy"]
             )
EPOCHS = 5

hist = model.fit(Train_Dataset, 
                 epochs=EPOCHS, 
                 validation_data=Val_Dataset, 
                 verbose=1)
Epoch 1/5
621/621 [==============================] - 285s 458ms/step - loss: 0.0579 - accuracy: 0.9812 - val_loss: 0.0319 - val_accuracy: 0.9887
Epoch 2/5
621/621 [==============================] - 362s 582ms/step - loss: 0.0383 - accuracy: 0.9867 - val_loss: 0.0371 - val_accuracy: 0.9881
Epoch 3/5
621/621 [==============================] - 384s 618ms/step - loss: 0.0339 - accuracy: 0.9892 - val_loss: 0.0327 - val_accuracy: 0.9895
Epoch 4/5
621/621 [==============================] - 473s 761ms/step - loss: 0.0355 - accuracy: 0.9880 - val_loss: 0.0291 - val_accuracy: 0.9899
Epoch 5/5
621/621 [==============================] - 436s 702ms/step - loss: 0.0331 - accuracy: 0.9902 - val_loss: 0.0357 - val_accuracy: 0.9895




  • 첫번째 Epoch에서 Train / Val. Set 모두에서 이미 높은 Accuracy를 보이네요.

  • 제대로 동작하는 것 같네요.




import matplotlib.pyplot as plt
plt.plot(hist.history["accuracy"])
plt.plot(hist.history["val_accuracy"])
plt.title("model accuracy")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(["train", "validation"], loc="upper left")
plt.show()








4. Summary

  • 이번 Post에서는 TFReocrd File Format을 읽어서 실제 Train에 적용시키는 방법까지 알아보았습니다.

  • 중요한 것은 TFReocrd File을 사용하기 위해서는 해당 Dataset을 만들때 사용한 구조( Feature )를 반드시 알고 있어야 한다는 것입니다.