98 lines
2.7 KiB
Plaintext
98 lines
2.7 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Gensim进阶教程:训练Doc2vec\n",
|
||
"Doc2vec是Mikolov在word2vec基础上提出的另一个用于计算长文本向量的工具。它的工作原理与word2vec极为相似——只是将长文本作为一个特殊的token id引入训练语料中。\n",
|
||
"\n",
|
||
"在Gensim中,doc2vec也是继承于word2vec的一个子类。因此,无论是API的参数接口还是调用文本向量的方式,doc2vec与word2vec都极为相似。\n",
|
||
"\n",
|
||
"主要的区别是在对输入数据的预处理上。Doc2vec接受一个由LabeledSentence对象组成的迭代器作为其构造函数的输入参数。其中,LabeledSentence是Gensim内建的一个类,它接受两个List作为其初始化的参数:word list和label list。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n",
|
||
"from nltk.tokenize import word_tokenize\n",
|
||
"data = [\"I love machine learning. Its awesome.\",\n",
|
||
" \"I love coding in python\",\n",
|
||
" \"I love building chatbots\",\n",
|
||
" \"they chat amagingly well\"]\n",
|
||
"tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=['SENT_%s' %str(i)]) for i, _d in enumerate(data)]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"scrolled": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = Doc2Vec(size=100,\n",
|
||
" alpha=0.01, \n",
|
||
" min_count=2,\n",
|
||
" dm =1)\n",
|
||
"model.build_vocab(tagged_data)\n",
|
||
"for epoch in range(10):\n",
|
||
" print('iteration {0}'.format(epoch))\n",
|
||
" model.train(tagged_data,\n",
|
||
" total_examples=model.corpus_count,\n",
|
||
" epochs=model.iter)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model[\"SENT_0\"]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"model.docvecs.most_similar(0)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.7.3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|