keras로 키워드 분석(2/5)

keras가 지원하는 embedding을 어떻게 사용하는지 몰랐다. keras가 제공하는 문서가 embedding 기능을 정확히 설명한다.

https://keras.io/layers/embeddings/#embedding

weight로 embedding_matrix를 입력하고, input으로 index를 입력하면 index를 vector로 변경한다. 따라서 아래와 같은 순서로 작업해야 한다.

  1. 미리 만든 word2vec 파일을 불러온다.
  2. 전체 vocab 총 양을 embedding_matrix로 설정한다.
  3. konlpy로 각 태그를 분리한다.
  4. 불러온 word2vec 파일에서 해당하는 단어 index를 구하고, 이를 문장으로 만든다.
  5. 적절한 길이로 padding한다.
  6. 이를 입력으로 먹인다.

대충 어떻게 할 지 감 잡았으니, 이제 다시 삽질을 시작하자.

다시 아래와 같이 작성했다. 일단 네트웍을 fit으로 훈련시킬 수 있는지 없는지 확인했다. 일단 된다. 모든 입력과 출력을 정확하게 읽었고, 2진 분류기로 만들었다. embedding도 제대로 되는 듯 하다.

from konlpy.tag import Okt
okt=Okt()
from gensim.models import Word2Vec
from keras.layers import Dense, Flatten
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.layers.embeddings import Embedding
import numpy as np

model=Word2Vec.load('./myModel')
#b=model.wv.most_similar(positive=["클램프", "잠김"])
print(model)

#print(model.wv["후드힌지시프트"])

#tokenizer 설정.
#Okt()사용.

targetFile = open("./tagv4태그붙인파일.csv", "r", encoding='UTF-8')

i=0
sentence_by_index=[]
traing_result=[]
WORD_MAX=6
WV_SIZE=4
MAX_VOCAB=3532

#못찾은 단어를 입력하기 위한 부분.
VACANT_ARRAY = np.zeros((4,1))


while True:

    lines = targetFile.readline()
    firstColumn = lines.split(',')
    #print(lines)
    
    if not lines:break
    #if i == 1000:break
    i=i+1
    #word2vec를 만든 형태소 분석기를 사용..
    tokenlist = okt.pos(firstColumn[1], stem=True, norm=True)
    temp=[]

    for word in tokenlist:
        #word[0]은 단어.
        #word[1]은 품사.
        #print("word[0]은",word[0])
        #print("word[1]은",word[1])

        if word[1] in ["Noun","Alpha","Number"]:
            #temp.append(model.wv[word[0]])
            #word[0]를 index로 변경.
            #단어장에 없는 단어를 예외처리
            #입력과 출력을 같이 맞추기 위해, 입출력 동시에 append
            try:
                #print("---------")
                #print(i)
                #print(word[0])
                temp.append(model.wv.vocab.get(word[0]).index)
                #print(model.wv.vocab.get(word[0]).index)

            except AttributeError:
                #값을 못찾으면 0값 입력
                temp.append(0)
                #print(temp)
    #print("index is ", i)
    #print("temp is", temp)

    #가져단 쓴 코드는 temp에 값이 있을 경우에만 append.
    #출력과 맞추기 위해, list가 비어있어도 append로 변경.
    #if temp:
    #    sentence_by_index.append(temp)
    sentence_by_index.append(temp)

    #결과를 배열로 입력
    tempResult=firstColumn[4].strip('\n')
    traing_result.append(tempResult)

targetFile.close()
training_result_asarray = np.asarray(traing_result)

#최대 단어를 6으로 설정.
#행 수보다 6까지 뒤쪽으로 0을 채움.
#word2Vec가 실수이므로 float32로 설정
#result가 embedding_matrix
fixed_sentence_by_index = pad_sequences(sentence_by_index, maxlen=WORD_MAX, padding='post', dtype='int')
print("입력 시퀀스는", fixed_sentence_by_index.shape)
print("출력 시퀀스는", training_result_asarray.shape)
#print("index로 변경한 값은",fixed_sentence_by_index)
#print("embedding vector는", model.wv.vectors)

#keras 모델 설정.
model2= Sequential()
model2.add(Embedding(input_dim=MAX_VOCAB, output_dim=4, input_length=WORD_MAX, weights=[model.wv.vectors], trainable=False))
model2.add(Flatten())
model2.add(Dense(1, activation='softmax'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.fit(x=fixed_sentence_by_index, y=training_result_asarray, epochs=1000, verbose=1, validation_split=0.2)

model2.summary()

네트웍을 조금 복잡하게 하고, gtx1060 gpu로 1000번 훈련 시켰다.

root@AMD-1804:/home/mnt/konlpy_190910_moved# python tagCheckerv3.py 
/usr/local/lib/python3.5/dist-packages/jpype/_core.py:210: UserWarning: 
-------------------------------------------------------------------------------
Deprecated: convertStrings was not specified when starting the JVM. The default
behavior in JPype will be False starting in JPype 0.8. The recommended setting
for new code is convertStrings=False.  The legacy value of True was assumed for
this session. If you are a user of an application that reported this warning,
please file a ticket with the developer.
-------------------------------------------------------------------------------

  """)
Using TensorFlow backend.
Word2Vec(vocab=3532, size=4, alpha=0.025)
2019-09-11 11:02:33.941682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.38GiB
2019-09-11 11:02:33.941740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-09-11 11:02:34.356140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-11 11:02:34.356206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-09-11 11:02:34.356217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-09-11 11:02:34.356473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5138 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
Train on 77696 samples, validate on 19424 samples
Epoch 1/1000
77696/77696 [==============================] - 19s 248us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 2/1000
77696/77696 [==============================] - 18s 238us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 3/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 4/1000
77696/77696 [==============================] - 19s 239us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 5/1000
77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 6/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 7/1000
77696/77696 [==============================] - 18s 238us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 8/1000
77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 9/1000
77696/77696 [==============================] - 19s 243us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 10/1000
77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 11/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 12/1000
77696/77696 [==============================] - 18s 238us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 13/1000
77696/77696 [==============================] - 18s 238us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 14/1000
77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 15/1000
77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 16/1000
77696/77696 [==============================] - 19s 242us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 17/1000
77696/77696 [==============================] - 19s 242us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 18/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 19/1000
77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 20/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 21/1000
77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 22/1000
77696/77696 [==============================] - 19s 242us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 23/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 24/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 25/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 26/1000
77696/77696 [==============================] - 19s 239us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 27/1000
77696/77696 [==============================] - 19s 242us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 28/1000
77696/77696 [==============================] - 19s 243us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 29/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 30/1000
77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 31/1000
77696/77696 [==============================] - 18s 237us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
Epoch 32/1000
77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325

acc와 loss가 변할 조짐이 없다. 다음에는 네트웍을 조금 복잡하게 해보고 다시 훈련시켜야겠다.

코멘트

댓글 남기기

이 사이트는 Akismet을 사용하여 스팸을 줄입니다. 댓글 데이터가 어떻게 처리되는지 알아보세요.