This commit is contained in:
chongjiu.jin
2019-10-21 18:05:16 +08:00
commit 75b33e19fa
12 changed files with 4535 additions and 0 deletions

View File

@@ -0,0 +1,97 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gensim进阶教程训练Doc2vec\n",
"Doc2vec是Mikolov在word2vec基础上提出的另一个用于计算长文本向量的工具。它的工作原理与word2vec极为相似——只是将长文本作为一个特殊的token id引入训练语料中。\n",
"\n",
"在Gensim中doc2vec也是继承于word2vec的一个子类。因此无论是API的参数接口还是调用文本向量的方式doc2vec与word2vec都极为相似。\n",
"\n",
"主要的区别是在对输入数据的预处理上。Doc2vec接受一个由LabeledSentence对象组成的迭代器作为其构造函数的输入参数。其中LabeledSentence是Gensim内建的一个类它接受两个List作为其初始化的参数word list和label list。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n",
"from nltk.tokenize import word_tokenize\n",
"data = [\"I love machine learning. Its awesome.\",\n",
" \"I love coding in python\",\n",
" \"I love building chatbots\",\n",
" \"they chat amagingly well\"]\n",
"tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=['SENT_%s' %str(i)]) for i, _d in enumerate(data)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"model = Doc2Vec(size=100,\n",
" alpha=0.01, \n",
" min_count=2,\n",
" dm =1)\n",
"model.build_vocab(tagged_data)\n",
"for epoch in range(10):\n",
" print('iteration {0}'.format(epoch))\n",
" model.train(tagged_data,\n",
" total_examples=model.corpus_count,\n",
" epochs=model.iter)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model[\"SENT_0\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.docvecs.most_similar(0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

View File

@@ -0,0 +1,963 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python Numpy Review\n",
"\n",
"主要复习numpy\n",
"\n",
"tutor: `chongjiujin # gmail.com`\n",
"\n",
"```\n",
"if you have any question in python or pytorch:\n",
"\n",
" print(add personal weichat:flypython)\n",
" ```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# List Slicing\n",
"\n",
"List elements can be accessed in convenient ways.\n",
"\n",
"Basic format: some_list[start_index:end_index]"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3, 4, 5, 6]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers = [0, 1, 2, 3, 4, 5, 6]\n",
"numbers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers[0:3]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers[:4]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[5, 6]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers[5:]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3, 4, 5, 6]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers[:]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Negative index wraps around\n",
"numbers[-1]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[4, 5, 6]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers[-3:]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Can mix and match\n",
"numbers[1:-10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Numpy python矩阵计算库\n",
"\n",
"\n",
"Optimized library for matrix and vector computation.\n",
"\n",
"用于矩阵和向量\n",
"\n",
"\n",
"\n",
"Makes use of C/C++ subroutines and memory-efficient data structures.\n",
"\n",
"底层是C/C++编译的,效率更高\n",
"\n",
"(Lots of computation can be efficiently represented as vectors.)\n",
"\n",
"**Main data type: `np.ndarray`**\n",
"\n",
"This is the data type that you will use to represent matrix/vector computations.\n",
"这个数据结构是用来放矩阵/向量的\n",
"\n",
"Note: constructor function is `np.array()`\n",
"\n",
" `np.array()`初始化函数\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np#导入库"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3,)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = np.array([1,2,3])#一维向量\n",
"x\n",
"x.shape"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2, 3)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = np.array([[3,4,5],[6,7,8]])#二维矩阵\n",
"y.shape"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3, 1)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = np.array([[1],[2],[3]])#每个框是增加一个维度\n",
"y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# np.ndarray Operations 操作函数\n",
"\n",
"Reductions: `np.max`, `np.min`, `np.argmax`, `np.sum`, `np.mean`, …\n",
"\n",
"Always reduces along an axis! (Or will reduce along all axes if not specified.)\n",
"\n",
"(You can think of this as “collapsing” this axis into the functions output.)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = np.array([1,2,3])#一维向量\n",
"x.max()#np.max(x)\n",
"#x.min()\n",
"#x.sum()\n",
"#x.mean()\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([[5],\n",
" [8]])"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = np.array([[3,4,5],[6,7,8]])#按维度取最大值\n",
"#np.max(y,axis = 1)\n",
"np.max(y, axis = 1, keepdims = True)\n",
"#https://docs.scipy.org/doc/numpy/reference/generated/numpy.amax.html#numpy.amax"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基本矩阵运算\n",
"\n",
"\n",
"`np.dot`矩阵点乘\n",
"$$ np.dot(v,w)=v^T w $$\n",
"https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html?highlight=dot#numpy.dot\n",
"\n",
"`np.multiply` 在 np.array 中重载为元素乘法,在 np.matrix 中重载为矩阵乘法\n",
"\n",
"https://docs.scipy.org/doc/numpy/reference/generated/numpy.multiply.html\n",
"\n",
"\n",
"我们这里只讨论一维向量"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#np.dot点乘\n",
"\n",
"x=np.array([1,2,3])#一维向量\n",
"y=np.array([1,2,3])#一维向量\n",
"np.dot(x,y)\n",
"#"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"14"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum(x.T*y)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 4, 9])"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x=np.array([1,2,3])#一维向量\n",
"np.multiply(x,x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Indexing 索引"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([3])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#基本同list\n",
"x = np.array([1,2,3])#一维向量\n",
"x[x > 2]\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([3, 2, 1])"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"index=[2,1,0]#按索引排序\n",
"x[index]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 矩阵遍历\n",
"\n",
"有时候需要遍历矩阵里所有的向量"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[3 4 5]\n",
" [6 7 8]]\n"
]
}
],
"source": [
"y = np.array([[3,4,5],[6,7,8]])#二维矩阵\n",
"print(y)\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[3 4 5]\n",
"-----\n",
"[6 7 8]\n",
"-----\n"
]
}
],
"source": [
"#默认按第1维度遍历\n",
"for y1 in y:\n",
" print(y1)\n",
" print(\"-----\")"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2 3\n"
]
}
],
"source": [
"#按指定维度遍历\n",
"d1,d2= y.shape\n",
"print(d1,d2)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 [3 6]\n",
"1 [4 7]\n",
"2 [5 8]\n"
]
}
],
"source": [
"for d in range(d2):\n",
" print(d,y[:,d])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Efficient Numpy Code\n",
"尽量用Numpy的特性提升效率"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"x = np.array([[3,4,5],[6,7,8]])#二维矩阵\n",
"y = np.array([[1,2,3],[9,0,10]])#二维矩阵"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 9, 16, 25],\n",
" [36, 49, 64]])"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"for i in range(x.shape[0]):\n",
" for j in range(x.shape[1]):\n",
" x[i,j] **= 2\n",
"x"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 81, 256, 625],\n",
" [1296, 2401, 4096]])"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x **= 2\n",
"x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 全0 和全 1 矩阵"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1.30950800e+06, 1.82888704e+08])"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z=np.zeros((2,))\n",
"for i in range(x.shape[0]):\n",
" x1=x[i]\n",
" y1=y[i]\n",
" z[i]=np.dot(x1,y1)\n",
"z"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 1., 1.],\n",
" [1., 1., 1.]])"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z=np.ones((2,3))\n",
"z"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 矩阵和常数计算以及 Broadcasting广播"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3, 3)"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = np.array([[3,4,5],[6,7,8],[1,2,3]])#二维矩阵\n",
"x.shape"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 5, 6, 7],\n",
" [ 8, 9, 10],\n",
" [ 3, 4, 5]])"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x+2"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 6, 8, 10],\n",
" [12, 14, 16],\n",
" [ 2, 4, 6]])"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x*2"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3, 1)"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y=np.array([[2],[4],[8]])\n",
"y.shape"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 5, 6, 7],\n",
" [10, 11, 12],\n",
" [ 9, 10, 11]])"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x+y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 矩阵变换"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1, 3)"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z=np.array([[2, 4, 8]])\n",
"z.shape"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3, 1)"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z=y.reshape(-1,1)\n",
"z.shape"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1, 3)"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z=y.T\n",
"z.shape"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 6, 16, 40],\n",
" [12, 28, 64],\n",
" [ 2, 8, 24]])"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x*z"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 思考题\n",
"y=np.array([[2],[4],[8]])\n",
"\n",
"(y + y.T)是什么\n",
"\n",
"\n",
"# 如果对操作有不确定开一个jupyter notebook测试后使用"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}