keras가 지원하는 embedding을 어떻게 사용하는지 몰랐다. keras가 제공하는 문서가 embedding 기능을 정확히 설명한다.
weight로 embedding_matrix를 입력하고, input으로 index를 입력하면 index를 vector로 변경한다. 따라서 아래와 같은 순서로 작업해야 한다.
- 미리 만든 word2vec 파일을 불러온다.
- 전체 vocab 총 양을 embedding_matrix로 설정한다.
- konlpy로 각 태그를 분리한다.
- 불러온 word2vec 파일에서 해당하는 단어 index를 구하고, 이를 문장으로 만든다.
- 적절한 길이로 padding한다.
- 이를 입력으로 먹인다.
대충 어떻게 할 지 감 잡았으니, 이제 다시 삽질을 시작하자.
다시 아래와 같이 작성했다. 일단 네트웍을 fit으로 훈련시킬 수 있는지 없는지 확인했다. 일단 된다. 모든 입력과 출력을 정확하게 읽었고, 2진 분류기로 만들었다. embedding도 제대로 되는 듯 하다.
from konlpy.tag import Okt okt=Okt() from gensim.models import Word2Vec from keras.layers import Dense, Flatten from keras.models import Sequential from keras.preprocessing.sequence import pad_sequences from keras.layers.embeddings import Embedding import numpy as np model=Word2Vec.load('./myModel') #b=model.wv.most_similar(positive=["클램프", "잠김"]) print(model) #print(model.wv["후드힌지시프트"]) #tokenizer 설정. #Okt()사용. targetFile = open("./tagv4태그붙인파일.csv", "r", encoding='UTF-8') i=0 sentence_by_index=[] traing_result=[] WORD_MAX=6 WV_SIZE=4 MAX_VOCAB=3532 #못찾은 단어를 입력하기 위한 부분. VACANT_ARRAY = np.zeros((4,1)) while True: lines = targetFile.readline() firstColumn = lines.split(',') #print(lines) if not lines:break #if i == 1000:break i=i+1 #word2vec를 만든 형태소 분석기를 사용.. tokenlist = okt.pos(firstColumn[1], stem=True, norm=True) temp=[] for word in tokenlist: #word[0]은 단어. #word[1]은 품사. #print("word[0]은",word[0]) #print("word[1]은",word[1]) if word[1] in ["Noun","Alpha","Number"]: #temp.append(model.wv[word[0]]) #word[0]를 index로 변경. #단어장에 없는 단어를 예외처리 #입력과 출력을 같이 맞추기 위해, 입출력 동시에 append try: #print("---------") #print(i) #print(word[0]) temp.append(model.wv.vocab.get(word[0]).index) #print(model.wv.vocab.get(word[0]).index) except AttributeError: #값을 못찾으면 0값 입력 temp.append(0) #print(temp) #print("index is ", i) #print("temp is", temp) #가져단 쓴 코드는 temp에 값이 있을 경우에만 append. #출력과 맞추기 위해, list가 비어있어도 append로 변경. #if temp: # sentence_by_index.append(temp) sentence_by_index.append(temp) #결과를 배열로 입력 tempResult=firstColumn[4].strip('\n') traing_result.append(tempResult) targetFile.close() training_result_asarray = np.asarray(traing_result) #최대 단어를 6으로 설정. #행 수보다 6까지 뒤쪽으로 0을 채움. #word2Vec가 실수이므로 float32로 설정 #result가 embedding_matrix fixed_sentence_by_index = pad_sequences(sentence_by_index, maxlen=WORD_MAX, padding='post', dtype='int') print("입력 시퀀스는", fixed_sentence_by_index.shape) print("출력 시퀀스는", training_result_asarray.shape) #print("index로 변경한 값은",fixed_sentence_by_index) #print("embedding vector는", model.wv.vectors) #keras 모델 설정. model2= Sequential() model2.add(Embedding(input_dim=MAX_VOCAB, output_dim=4, input_length=WORD_MAX, weights=[model.wv.vectors], trainable=False)) model2.add(Flatten()) model2.add(Dense(1, activation='softmax')) model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model2.fit(x=fixed_sentence_by_index, y=training_result_asarray, epochs=1000, verbose=1, validation_split=0.2) model2.summary()
네트웍을 조금 복잡하게 하고, gtx1060 gpu로 1000번 훈련 시켰다.
root@AMD-1804:/home/mnt/konlpy_190910_moved# python tagCheckerv3.py /usr/local/lib/python3.5/dist-packages/jpype/_core.py:210: UserWarning: ------------------------------------------------------------------------------- Deprecated: convertStrings was not specified when starting the JVM. The default behavior in JPype will be False starting in JPype 0.8. The recommended setting for new code is convertStrings=False. The legacy value of True was assumed for this session. If you are a user of an application that reported this warning, please file a ticket with the developer. ------------------------------------------------------------------------------- """) Using TensorFlow backend. Word2Vec(vocab=3532, size=4, alpha=0.025) 2019-09-11 11:02:33.941682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085 pciBusID: 0000:01:00.0 totalMemory: 5.93GiB freeMemory: 5.38GiB 2019-09-11 11:02:33.941740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-09-11 11:02:34.356140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-11 11:02:34.356206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-09-11 11:02:34.356217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-09-11 11:02:34.356473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5138 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1) Train on 77696 samples, validate on 19424 samples Epoch 1/1000 77696/77696 [==============================] - 19s 248us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 2/1000 77696/77696 [==============================] - 18s 238us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 3/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 4/1000 77696/77696 [==============================] - 19s 239us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 5/1000 77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 6/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 7/1000 77696/77696 [==============================] - 18s 238us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 8/1000 77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 9/1000 77696/77696 [==============================] - 19s 243us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 10/1000 77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 11/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 12/1000 77696/77696 [==============================] - 18s 238us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 13/1000 77696/77696 [==============================] - 18s 238us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 14/1000 77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 15/1000 77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 16/1000 77696/77696 [==============================] - 19s 242us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 17/1000 77696/77696 [==============================] - 19s 242us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 18/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 19/1000 77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 20/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 21/1000 77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 22/1000 77696/77696 [==============================] - 19s 242us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 23/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 24/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 25/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 26/1000 77696/77696 [==============================] - 19s 239us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 27/1000 77696/77696 [==============================] - 19s 242us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 28/1000 77696/77696 [==============================] - 19s 243us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 29/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 30/1000 77696/77696 [==============================] - 19s 240us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 31/1000 77696/77696 [==============================] - 18s 237us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325 Epoch 32/1000 77696/77696 [==============================] - 19s 241us/step - loss: 11.2624 - acc: 0.2936 - val_loss: 12.2350 - val_acc: 0.2325
acc와 loss가 변할 조짐이 없다. 다음에는 네트웍을 조금 복잡하게 해보고 다시 훈련시켜야겠다.