commit 75b33e19fad3e0e074af0bb8e4297f774cc79fb6 Author: chongjiu.jin Date: Mon Oct 21 18:05:16 2019 +0800 a1 diff --git a/Assignment_1_intro_word_vectors/Gensim Doc2vec.ipynb b/Assignment_1_intro_word_vectors/Gensim Doc2vec.ipynb new file mode 100644 index 0000000..34ced46 --- /dev/null +++ b/Assignment_1_intro_word_vectors/Gensim Doc2vec.ipynb @@ -0,0 +1,97 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Gensim进阶教程:训练Doc2vec\n", + "Doc2vec是Mikolov在word2vec基础上提出的另一个用于计算长文本向量的工具。它的工作原理与word2vec极为相似——只是将长文本作为一个特殊的token id引入训练语料中。\n", + "\n", + "在Gensim中,doc2vec也是继承于word2vec的一个子类。因此,无论是API的参数接口还是调用文本向量的方式,doc2vec与word2vec都极为相似。\n", + "\n", + "主要的区别是在对输入数据的预处理上。Doc2vec接受一个由LabeledSentence对象组成的迭代器作为其构造函数的输入参数。其中,LabeledSentence是Gensim内建的一个类,它接受两个List作为其初始化的参数:word list和label list。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n", + "from nltk.tokenize import word_tokenize\n", + "data = [\"I love machine learning. Its awesome.\",\n", + " \"I love coding in python\",\n", + " \"I love building chatbots\",\n", + " \"they chat amagingly well\"]\n", + "tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=['SENT_%s' %str(i)]) for i, _d in enumerate(data)]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "model = Doc2Vec(size=100,\n", + " alpha=0.01, \n", + " min_count=2,\n", + " dm =1)\n", + "model.build_vocab(tagged_data)\n", + "for epoch in range(10):\n", + " print('iteration {0}'.format(epoch))\n", + " model.train(tagged_data,\n", + " total_examples=model.corpus_count,\n", + " epochs=model.iter)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model[\"SENT_0\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model.docvecs.most_similar(0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/Assignment_1_intro_word_vectors/Gensim word vector visualization1.ipynb b/Assignment_1_intro_word_vectors/Gensim word vector visualization1.ipynb new file mode 100644 index 0000000..af9fba0 --- /dev/null +++ b/Assignment_1_intro_word_vectors/Gensim word vector visualization1.ipynb @@ -0,0 +1,1987 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Gensim word vector visualization of various word vectors" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "# Get the interactive Tools for Matplotlib\n", + "%matplotlib notebook\n", + "import matplotlib.pyplot as plt\n", + "plt.style.use('ggplot')\n", + "\n", + "from sklearn.decomposition import PCA\n", + "\n", + "from gensim.test.utils import datapath, get_tmpfile\n", + "from gensim.models import KeyedVectors\n", + "from gensim.scripts.glove2word2vec import glove2word2vec" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For looking at word vectors, I'll use Gensim. We also use it in hw1 for word vectors. Gensim isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our homegrown Stanford offering is GloVe word vectors. Gensim doesn't give them first class support, but allows you to convert a file of GloVe vectors into word2vec format. You can download the GloVe vectors from [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(I use the 100d vectors below as a mix between speed and smallness vs. quality. If you try out the 50d vectors, they basically work for similarity but clearly aren't as good for analogy problems. If you load the 300d vectors, they're even better than the 100d vectors.)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(400000, 100)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "glove_file = datapath('D:\\\\project\\\\ml\\\\github\\\\cs224n-natural-language-processing-winter2019\\\\a1_intro_word_vectors\\\\a1\\\\glove.6B.100d.txt')\n", + "word2vec_glove_file = get_tmpfile(\"glove.6B.100d.word2vec.txt\")\n", + "glove2word2vec(glove_file, word2vec_glove_file)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "model = KeyedVectors.load_word2vec_format(word2vec_glove_file)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('barack', 0.937216579914093),\n", + " ('bush', 0.9272854328155518),\n", + " ('clinton', 0.8960003852844238),\n", + " ('mccain', 0.8875634074211121),\n", + " ('gore', 0.8000321388244629),\n", + " ('hillary', 0.7933663129806519),\n", + " ('dole', 0.7851964235305786),\n", + " ('rodham', 0.7518897652626038),\n", + " ('romney', 0.7488930225372314),\n", + " ('kerry', 0.7472623586654663)]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.most_similar('obama')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('coconut', 0.7097253799438477),\n", + " ('mango', 0.7054824233055115),\n", + " ('bananas', 0.6887733340263367),\n", + " ('potato', 0.6629636287689209),\n", + " ('pineapple', 0.6534532308578491),\n", + " ('fruit', 0.6519854664802551),\n", + " ('peanut', 0.6420576572418213),\n", + " ('pecan', 0.6349173188209534),\n", + " ('cashew', 0.629442036151886),\n", + " ('papaya', 0.6246591210365295)]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.most_similar('banana')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model.most_similar(negative='banana')" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "queen: 0.7699\n" + ] + } + ], + "source": [ + "result = model.most_similar(positive=['woman', 'king'], negative=['man'])\n", + "print(\"{}: {:.4f}\".format(*result[0]))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "def analogy(x1, x2, y1):\n", + " result = model.most_similar(positive=[y1, x2], negative=[x1])\n", + " return result[0][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Analogy](imgs/word2vec-king-queen-composition.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'australian'" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "analogy('japan', 'japanese', 'australia')" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'champagne'" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "analogy('australia', 'beer', 'france')" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'nixon'" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "analogy('obama', 'clinton', 'reagan')" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'longest'" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "analogy('tall', 'tallest', 'long')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'terrible'" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "analogy('good', 'fantastic', 'bad')" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cereal\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\gensim\\models\\keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n", + " vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n" + ] + } + ], + "source": [ + "print(model.doesnt_match(\"breakfast cereal dinner lunch\".split()))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "def display_pca_scatterplot(model, words=None, sample=0):\n", + " if words == None:\n", + " if sample > 0:\n", + " words = np.random.choice(list(model.vocab.keys()), sample)\n", + " else:\n", + " words = [ word for word in model.vocab ]\n", + " \n", + " word_vectors = np.array([model[w] for w in words])\n", + "\n", + " twodim = PCA().fit_transform(word_vectors)[:,:2]\n", + " \n", + " plt.figure(figsize=(6,6))\n", + " plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')\n", + " for word, (x,y) in zip(words, twodim):\n", + " plt.text(x+0.05, y+0.05, word)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "application/javascript": [ + "/* Put everything inside the global mpl namespace */\n", + "window.mpl = {};\n", + "\n", + "\n", + "mpl.get_websocket_type = function() {\n", + " if (typeof(WebSocket) !== 'undefined') {\n", + " return WebSocket;\n", + " } else if (typeof(MozWebSocket) !== 'undefined') {\n", + " return MozWebSocket;\n", + " } else {\n", + " alert('Your browser does not have WebSocket support.' +\n", + " 'Please try Chrome, Safari or Firefox ≥ 6. ' +\n", + " 'Firefox 4 and 5 are also supported but you ' +\n", + " 'have to enable WebSockets in about:config.');\n", + " };\n", + "}\n", + "\n", + "mpl.figure = function(figure_id, websocket, ondownload, parent_element) {\n", + " this.id = figure_id;\n", + "\n", + " this.ws = websocket;\n", + "\n", + " this.supports_binary = (this.ws.binaryType != undefined);\n", + "\n", + " if (!this.supports_binary) {\n", + " var warnings = document.getElementById(\"mpl-warnings\");\n", + " if (warnings) {\n", + " warnings.style.display = 'block';\n", + " warnings.textContent = (\n", + " \"This browser does not support binary websocket messages. \" +\n", + " \"Performance may be slow.\");\n", + " }\n", + " }\n", + "\n", + " this.imageObj = new Image();\n", + "\n", + " this.context = undefined;\n", + " this.message = undefined;\n", + " this.canvas = undefined;\n", + " this.rubberband_canvas = undefined;\n", + " this.rubberband_context = undefined;\n", + " this.format_dropdown = undefined;\n", + "\n", + " this.image_mode = 'full';\n", + "\n", + " this.root = $('
');\n", + " this._root_extra_style(this.root)\n", + " this.root.attr('style', 'display: inline-block');\n", + "\n", + " $(parent_element).append(this.root);\n", + "\n", + " this._init_header(this);\n", + " this._init_canvas(this);\n", + " this._init_toolbar(this);\n", + "\n", + " var fig = this;\n", + "\n", + " this.waiting = false;\n", + "\n", + " this.ws.onopen = function () {\n", + " fig.send_message(\"supports_binary\", {value: fig.supports_binary});\n", + " fig.send_message(\"send_image_mode\", {});\n", + " if (mpl.ratio != 1) {\n", + " fig.send_message(\"set_dpi_ratio\", {'dpi_ratio': mpl.ratio});\n", + " }\n", + " fig.send_message(\"refresh\", {});\n", + " }\n", + "\n", + " this.imageObj.onload = function() {\n", + " if (fig.image_mode == 'full') {\n", + " // Full images could contain transparency (where diff images\n", + " // almost always do), so we need to clear the canvas so that\n", + " // there is no ghosting.\n", + " fig.context.clearRect(0, 0, fig.canvas.width, fig.canvas.height);\n", + " }\n", + " fig.context.drawImage(fig.imageObj, 0, 0);\n", + " };\n", + "\n", + " this.imageObj.onunload = function() {\n", + " fig.ws.close();\n", + " }\n", + "\n", + " this.ws.onmessage = this._make_on_message_function(this);\n", + "\n", + " this.ondownload = ondownload;\n", + "}\n", + "\n", + "mpl.figure.prototype._init_header = function() {\n", + " var titlebar = $(\n", + " '
');\n", + " var titletext = $(\n", + " '
');\n", + " titlebar.append(titletext)\n", + " this.root.append(titlebar);\n", + " this.header = titletext[0];\n", + "}\n", + "\n", + "\n", + "\n", + "mpl.figure.prototype._canvas_extra_style = function(canvas_div) {\n", + "\n", + "}\n", + "\n", + "\n", + "mpl.figure.prototype._root_extra_style = function(canvas_div) {\n", + "\n", + "}\n", + "\n", + "mpl.figure.prototype._init_canvas = function() {\n", + " var fig = this;\n", + "\n", + " var canvas_div = $('
');\n", + "\n", + " canvas_div.attr('style', 'position: relative; clear: both; outline: 0');\n", + "\n", + " function canvas_keyboard_event(event) {\n", + " return fig.key_event(event, event['data']);\n", + " }\n", + "\n", + " canvas_div.keydown('key_press', canvas_keyboard_event);\n", + " canvas_div.keyup('key_release', canvas_keyboard_event);\n", + " this.canvas_div = canvas_div\n", + " this._canvas_extra_style(canvas_div)\n", + " this.root.append(canvas_div);\n", + "\n", + " var canvas = $('');\n", + " canvas.addClass('mpl-canvas');\n", + " canvas.attr('style', \"left: 0; top: 0; z-index: 0; outline: 0\")\n", + "\n", + " this.canvas = canvas[0];\n", + " this.context = canvas[0].getContext(\"2d\");\n", + "\n", + " var backingStore = this.context.backingStorePixelRatio ||\n", + "\tthis.context.webkitBackingStorePixelRatio ||\n", + "\tthis.context.mozBackingStorePixelRatio ||\n", + "\tthis.context.msBackingStorePixelRatio ||\n", + "\tthis.context.oBackingStorePixelRatio ||\n", + "\tthis.context.backingStorePixelRatio || 1;\n", + "\n", + " mpl.ratio = (window.devicePixelRatio || 1) / backingStore;\n", + "\n", + " var rubberband = $('');\n", + " rubberband.attr('style', \"position: absolute; left: 0; top: 0; z-index: 1;\")\n", + "\n", + " var pass_mouse_events = true;\n", + "\n", + " canvas_div.resizable({\n", + " start: function(event, ui) {\n", + " pass_mouse_events = false;\n", + " },\n", + " resize: function(event, ui) {\n", + " fig.request_resize(ui.size.width, ui.size.height);\n", + " },\n", + " stop: function(event, ui) {\n", + " pass_mouse_events = true;\n", + " fig.request_resize(ui.size.width, ui.size.height);\n", + " },\n", + " });\n", + "\n", + " function mouse_event_fn(event) {\n", + " if (pass_mouse_events)\n", + " return fig.mouse_event(event, event['data']);\n", + " }\n", + "\n", + " rubberband.mousedown('button_press', mouse_event_fn);\n", + " rubberband.mouseup('button_release', mouse_event_fn);\n", + " // Throttle sequential mouse events to 1 every 20ms.\n", + " rubberband.mousemove('motion_notify', mouse_event_fn);\n", + "\n", + " rubberband.mouseenter('figure_enter', mouse_event_fn);\n", + " rubberband.mouseleave('figure_leave', mouse_event_fn);\n", + "\n", + " canvas_div.on(\"wheel\", function (event) {\n", + " event = event.originalEvent;\n", + " event['data'] = 'scroll'\n", + " if (event.deltaY < 0) {\n", + " event.step = 1;\n", + " } else {\n", + " event.step = -1;\n", + " }\n", + " mouse_event_fn(event);\n", + " });\n", + "\n", + " canvas_div.append(canvas);\n", + " canvas_div.append(rubberband);\n", + "\n", + " this.rubberband = rubberband;\n", + " this.rubberband_canvas = rubberband[0];\n", + " this.rubberband_context = rubberband[0].getContext(\"2d\");\n", + " this.rubberband_context.strokeStyle = \"#000000\";\n", + "\n", + " this._resize_canvas = function(width, height) {\n", + " // Keep the size of the canvas, canvas container, and rubber band\n", + " // canvas in synch.\n", + " canvas_div.css('width', width)\n", + " canvas_div.css('height', height)\n", + "\n", + " canvas.attr('width', width * mpl.ratio);\n", + " canvas.attr('height', height * mpl.ratio);\n", + " canvas.attr('style', 'width: ' + width + 'px; height: ' + height + 'px;');\n", + "\n", + " rubberband.attr('width', width);\n", + " rubberband.attr('height', height);\n", + " }\n", + "\n", + " // Set the figure to an initial 600x600px, this will subsequently be updated\n", + " // upon first draw.\n", + " this._resize_canvas(600, 600);\n", + "\n", + " // Disable right mouse context menu.\n", + " $(this.rubberband_canvas).bind(\"contextmenu\",function(e){\n", + " return false;\n", + " });\n", + "\n", + " function set_focus () {\n", + " canvas.focus();\n", + " canvas_div.focus();\n", + " }\n", + "\n", + " window.setTimeout(set_focus, 100);\n", + "}\n", + "\n", + "mpl.figure.prototype._init_toolbar = function() {\n", + " var fig = this;\n", + "\n", + " var nav_element = $('
')\n", + " nav_element.attr('style', 'width: 100%');\n", + " this.root.append(nav_element);\n", + "\n", + " // Define a callback function for later on.\n", + " function toolbar_event(event) {\n", + " return fig.toolbar_button_onclick(event['data']);\n", + " }\n", + " function toolbar_mouse_event(event) {\n", + " return fig.toolbar_button_onmouseover(event['data']);\n", + " }\n", + "\n", + " for(var toolbar_ind in mpl.toolbar_items) {\n", + " var name = mpl.toolbar_items[toolbar_ind][0];\n", + " var tooltip = mpl.toolbar_items[toolbar_ind][1];\n", + " var image = mpl.toolbar_items[toolbar_ind][2];\n", + " var method_name = mpl.toolbar_items[toolbar_ind][3];\n", + "\n", + " if (!name) {\n", + " // put a spacer in here.\n", + " continue;\n", + " }\n", + " var button = $('');\n", + " button.click(method_name, toolbar_event);\n", + " button.mouseover(tooltip, toolbar_mouse_event);\n", + " nav_element.append(button);\n", + " }\n", + "\n", + " // Add the status bar.\n", + " var status_bar = $('');\n", + " nav_element.append(status_bar);\n", + " this.message = status_bar[0];\n", + "\n", + " // Add the close button to the window.\n", + " var buttongrp = $('
');\n", + " var button = $('');\n", + " button.click(function (evt) { fig.handle_close(fig, {}); } );\n", + " button.mouseover('Stop Interaction', toolbar_mouse_event);\n", + " buttongrp.append(button);\n", + " var titlebar = this.root.find($('.ui-dialog-titlebar'));\n", + " titlebar.prepend(buttongrp);\n", + "}\n", + "\n", + "mpl.figure.prototype._root_extra_style = function(el){\n", + " var fig = this\n", + " el.on(\"remove\", function(){\n", + "\tfig.close_ws(fig, {});\n", + " });\n", + "}\n", + "\n", + "mpl.figure.prototype._canvas_extra_style = function(el){\n", + " // this is important to make the div 'focusable\n", + " el.attr('tabindex', 0)\n", + " // reach out to IPython and tell the keyboard manager to turn it's self\n", + " // off when our div gets focus\n", + "\n", + " // location in version 3\n", + " if (IPython.notebook.keyboard_manager) {\n", + " IPython.notebook.keyboard_manager.register_events(el);\n", + " }\n", + " else {\n", + " // location in version 2\n", + " IPython.keyboard_manager.register_events(el);\n", + " }\n", + "\n", + "}\n", + "\n", + "mpl.figure.prototype._key_event_extra = function(event, name) {\n", + " var manager = IPython.notebook.keyboard_manager;\n", + " if (!manager)\n", + " manager = IPython.keyboard_manager;\n", + "\n", + " // Check for shift+enter\n", + " if (event.shiftKey && event.which == 13) {\n", + " this.canvas_div.blur();\n", + " event.shiftKey = false;\n", + " // Send a \"J\" for go to next cell\n", + " event.which = 74;\n", + " event.keyCode = 74;\n", + " manager.command_mode();\n", + " manager.handle_keydown(event);\n", + " }\n", + "}\n", + "\n", + "mpl.figure.prototype.handle_save = function(fig, msg) {\n", + " fig.ondownload(fig, null);\n", + "}\n", + "\n", + "\n", + "mpl.find_output_cell = function(html_output) {\n", + " // Return the cell and output element which can be found *uniquely* in the notebook.\n", + " // Note - this is a bit hacky, but it is done because the \"notebook_saving.Notebook\"\n", + " // IPython event is triggered only after the cells have been serialised, which for\n", + " // our purposes (turning an active figure into a static one), is too late.\n", + " var cells = IPython.notebook.get_cells();\n", + " var ncells = cells.length;\n", + " for (var i=0; i= 3 moved mimebundle to data attribute of output\n", + " data = data.data;\n", + " }\n", + " if (data['text/html'] == html_output) {\n", + " return [cell, data, j];\n", + " }\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "// Register the function which deals with the matplotlib target/channel.\n", + "// The kernel may be null if the page has been refreshed.\n", + "if (IPython.notebook.kernel != null) {\n", + " IPython.notebook.kernel.comm_manager.register_target('matplotlib', mpl.mpl_figure_comm);\n", + "}\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display_pca_scatterplot(model, sample=300)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/Assignment_1_intro_word_vectors/exploring_word_vectors.ipynb b/Assignment_1_intro_word_vectors/exploring_word_vectors.ipynb new file mode 100644 index 0000000..caee3f0 --- /dev/null +++ b/Assignment_1_intro_word_vectors/exploring_word_vectors.ipynb @@ -0,0 +1,1456 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# CS224N Assignment 1: Exploring Word Vectors (25 Points)\n", + "\n", + "Welcome to CS224n! \n", + "\n", + "Before you start, make sure you read the README.txt in the same directory as this notebook and enter your SUID below. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Please Enter Your SUID Here: deepeye" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[nltk_data] Downloading package reuters to\n", + "[nltk_data] C:\\Users\\Chongjiu\\AppData\\Roaming\\nltk_data...\n", + "[nltk_data] Package reuters is already up-to-date!\n" + ] + } + ], + "source": [ + "# All Import Statements Defined Here\n", + "# Note: Do not add to this list.\n", + "# All the dependencies you need, can be installed by running .\n", + "# ----------------\n", + "\n", + "import sys\n", + "assert sys.version_info[0]==3\n", + "assert sys.version_info[1] >= 5\n", + "\n", + "from gensim.models import KeyedVectors\n", + "from gensim.test.utils import datapath\n", + "import pprint\n", + "import matplotlib.pyplot as plt\n", + "plt.rcParams['figure.figsize'] = [10, 5]\n", + "import nltk\n", + "nltk.download('reuters')\n", + "from nltk.corpus import reuters\n", + "import numpy as np\n", + "import random\n", + "import scipy as sp\n", + "from sklearn.decomposition import TruncatedSVD\n", + "from sklearn.decomposition import PCA\n", + "\n", + "START_TOKEN = ''\n", + "END_TOKEN = ''\n", + "\n", + "np.random.seed(0)\n", + "random.seed(0)\n", + "# ----------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Word Vectors\n", + "\n", + "Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from *co-occurrence matrices*, and those derived via *word2vec*. \n", + "\n", + "**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.\n", + "\n", + "**Note on Terminology:** The terms \"word vectors\" and \"word embeddings\" are often used interchangeably. The term \"embedding\" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding) states, \"*conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension*\"." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 1: Count-Based Word Vectors (10 points)\n", + "\n", + "Most word vector models start from the following idea:\n", + "\n", + "*You shall know a word by the company it keeps ([Firth, J. R. 1957:11](https://en.wikipedia.org/wiki/John_Rupert_Firth))*\n", + "\n", + "Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many \"old school\" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, *co-occurrence matrices* (for more information, see [here](http://web.stanford.edu/class/cs124/lec/vectorsemantics.video.pdf) or [here](https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285))." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Co-Occurrence\n", + "\n", + "A co-occurrence matrix counts how often things co-occur in some environment. Given some word $w_i$ occurring in the document, we consider the *context window* surrounding $w_i$. Supposing our fixed window size is $n$, then this is the $n$ preceding and $n$ subsequent words in that document, i.e. words $w_{i-n} \\dots w_{i-1}$ and $w_{i+1} \\dots w_{i+n}$. We build a *co-occurrence matrix* $M$, which is a symmetric word-by-word matrix in which $M_{ij}$ is the number of times $w_j$ appears inside $w_i$'s window.\n", + "\n", + "**Example: Co-Occurrence with Fixed Window of n=1**:\n", + "\n", + "Document 1: \"all that glitters is not gold\"\n", + "\n", + "Document 2: \"all is well that ends well\"\n", + "\n", + "\n", + "| * | START | all | that | glitters | is | not | gold | well | ends | END |\n", + "|----------|-------|-----|------|----------|------|------|-------|------|------|-----|\n", + "| START | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n", + "| all | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |\n", + "| that | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |\n", + "| glitters | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |\n", + "| is | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |\n", + "| not | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |\n", + "| gold | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |\n", + "| well | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |\n", + "| ends | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |\n", + "| END | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |\n", + "\n", + "**Note:** In NLP, we often add START and END tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine START and END tokens encapsulating each document, e.g., \"START All that glitters is not gold END\", and include these tokens in our co-occurrence counts.\n", + "\n", + "The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run *dimensionality reduction*. In particular, we will run *SVD (Singular Value Decomposition)*, which is a kind of generalized *PCA (Principal Components Analysis)* to select the top $k$ principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is $A$ with $n$ rows corresponding to $n$ words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal $S$ matrix, and our new, shorter length-$k$ word vectors in $U_k$.\n", + "\n", + "This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. *doctor* and *hospital* will be closer than *doctor* and *dog*. \n", + "\n", + "**Notes:** If you can barely remember what an eigenvalue is, here's [a slow, friendly introduction to SVD](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf). If you want to learn more thoroughly about PCA or SVD, feel free to check out lectures [7](https://web.stanford.edu/class/cs168/l/l7.pdf), [8](http://theory.stanford.edu/~tim/s15/l/l8.pdf), and [9](https://web.stanford.edu/class/cs168/l/l9.pdf) of CS168. These course notes provide a great high-level treatment of these general purpose algorithms. Though, for the purpose of this class, you only need to know how to extract the k-dimensional embeddings by utilizing pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages. In practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA or SVD. However, if you only want the top $k$ vector components for relatively small $k$ — known as *[Truncated SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition#Truncated_SVD)* — then there are reasonably scalable techniques to compute those iteratively." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Plotting Co-Occurrence Word Embeddings\n", + "\n", + "Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. We provide a `read_corpus` function below that pulls out only articles from the \"crude\" (i.e. news articles about oil, gas, etc.) category. The function also adds START and END tokens to each of the documents, and lowercases words. You do **not** have perform any other kind of pre-processing." + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [], + "source": [ + "def read_corpus(category=\"crude\"):\n", + " \"\"\" Read files from the specified Reuter's category.\n", + " Params:\n", + " category (string): category name\n", + " Return:\n", + " list of lists, with words from each of the processed files\n", + " \"\"\"\n", + " files = reuters.fileids(category)\n", + " return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's have a look what these documents are like…." + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[['', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',\n", + " 'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',\n", + " 'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',\n", + " 'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',\n", + " 'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',\n", + " 'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',\n", + " '(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',\n", + " 'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',\n", + " 'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',\n", + " 'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',\n", + " 'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',\n", + " 'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',\n", + " 'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',\n", + " 'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',\n", + " 'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', \"'\", 's',\n", + " 'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',\n", + " 'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',\n", + " 'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',\n", + " 'they', 'noted', '.', ''],\n", + " ['', 'energy', '/', 'u', '.', 's', '.', 'petrochemical', 'industry', 'cheap', 'oil',\n", + " 'feedstocks', ',', 'the', 'weakened', 'u', '.', 's', '.', 'dollar', 'and', 'a', 'plant',\n", + " 'utilization', 'rate', 'approaching', '90', 'pct', 'will', 'propel', 'the', 'streamlined', 'u',\n", + " '.', 's', '.', 'petrochemical', 'industry', 'to', 'record', 'profits', 'this', 'year', ',',\n", + " 'with', 'growth', 'expected', 'through', 'at', 'least', '1990', ',', 'major', 'company',\n", + " 'executives', 'predicted', '.', 'this', 'bullish', 'outlook', 'for', 'chemical', 'manufacturing',\n", + " 'and', 'an', 'industrywide', 'move', 'to', 'shed', 'unrelated', 'businesses', 'has', 'prompted',\n", + " 'gaf', 'corp', '&', 'lt', ';', 'gaf', '>,', 'privately', '-', 'held', 'cain', 'chemical', 'inc',\n", + " ',', 'and', 'other', 'firms', 'to', 'aggressively', 'seek', 'acquisitions', 'of', 'petrochemical',\n", + " 'plants', '.', 'oil', 'companies', 'such', 'as', 'ashland', 'oil', 'inc', '&', 'lt', ';', 'ash',\n", + " '>,', 'the', 'kentucky', '-', 'based', 'oil', 'refiner', 'and', 'marketer', ',', 'are', 'also',\n", + " 'shopping', 'for', 'money', '-', 'making', 'petrochemical', 'businesses', 'to', 'buy', '.', '\"',\n", + " 'i', 'see', 'us', 'poised', 'at', 'the', 'threshold', 'of', 'a', 'golden', 'period', ',\"', 'said',\n", + " 'paul', 'oreffice', ',', 'chairman', 'of', 'giant', 'dow', 'chemical', 'co', '&', 'lt', ';',\n", + " 'dow', '>,', 'adding', ',', '\"', 'there', \"'\", 's', 'no', 'major', 'plant', 'capacity', 'being',\n", + " 'added', 'around', 'the', 'world', 'now', '.', 'the', 'whole', 'game', 'is', 'bringing', 'out',\n", + " 'new', 'products', 'and', 'improving', 'the', 'old', 'ones', '.\"', 'analysts', 'say', 'the',\n", + " 'chemical', 'industry', \"'\", 's', 'biggest', 'customers', ',', 'automobile', 'manufacturers',\n", + " 'and', 'home', 'builders', 'that', 'use', 'a', 'lot', 'of', 'paints', 'and', 'plastics', ',',\n", + " 'are', 'expected', 'to', 'buy', 'quantities', 'this', 'year', '.', 'u', '.', 's', '.',\n", + " 'petrochemical', 'plants', 'are', 'currently', 'operating', 'at', 'about', '90', 'pct',\n", + " 'capacity', ',', 'reflecting', 'tighter', 'supply', 'that', 'could', 'hike', 'product', 'prices',\n", + " 'by', '30', 'to', '40', 'pct', 'this', 'year', ',', 'said', 'john', 'dosher', ',', 'managing',\n", + " 'director', 'of', 'pace', 'consultants', 'inc', 'of', 'houston', '.', 'demand', 'for', 'some',\n", + " 'products', 'such', 'as', 'styrene', 'could', 'push', 'profit', 'margins', 'up', 'by', 'as',\n", + " 'much', 'as', '300', 'pct', ',', 'he', 'said', '.', 'oreffice', ',', 'speaking', 'at', 'a',\n", + " 'meeting', 'of', 'chemical', 'engineers', 'in', 'houston', ',', 'said', 'dow', 'would', 'easily',\n", + " 'top', 'the', '741', 'mln', 'dlrs', 'it', 'earned', 'last', 'year', 'and', 'predicted', 'it',\n", + " 'would', 'have', 'the', 'best', 'year', 'in', 'its', 'history', '.', 'in', '1985', ',', 'when',\n", + " 'oil', 'prices', 'were', 'still', 'above', '25', 'dlrs', 'a', 'barrel', 'and', 'chemical',\n", + " 'exports', 'were', 'adversely', 'affected', 'by', 'the', 'strong', 'u', '.', 's', '.', 'dollar',\n", + " ',', 'dow', 'had', 'profits', 'of', '58', 'mln', 'dlrs', '.', '\"', 'i', 'believe', 'the',\n", + " 'entire', 'chemical', 'industry', 'is', 'headed', 'for', 'a', 'record', 'year', 'or', 'close',\n", + " 'to', 'it', ',\"', 'oreffice', 'said', '.', 'gaf', 'chairman', 'samuel', 'heyman', 'estimated',\n", + " 'that', 'the', 'u', '.', 's', '.', 'chemical', 'industry', 'would', 'report', 'a', '20', 'pct',\n", + " 'gain', 'in', 'profits', 'during', '1987', '.', 'last', 'year', ',', 'the', 'domestic',\n", + " 'industry', 'earned', 'a', 'total', 'of', '13', 'billion', 'dlrs', ',', 'a', '54', 'pct', 'leap',\n", + " 'from', '1985', '.', 'the', 'turn', 'in', 'the', 'fortunes', 'of', 'the', 'once', '-', 'sickly',\n", + " 'chemical', 'industry', 'has', 'been', 'brought', 'about', 'by', 'a', 'combination', 'of', 'luck',\n", + " 'and', 'planning', ',', 'said', 'pace', \"'\", 's', 'john', 'dosher', '.', 'dosher', 'said', 'last',\n", + " 'year', \"'\", 's', 'fall', 'in', 'oil', 'prices', 'made', 'feedstocks', 'dramatically', 'cheaper',\n", + " 'and', 'at', 'the', 'same', 'time', 'the', 'american', 'dollar', 'was', 'weakening', 'against',\n", + " 'foreign', 'currencies', '.', 'that', 'helped', 'boost', 'u', '.', 's', '.', 'chemical',\n", + " 'exports', '.', 'also', 'helping', 'to', 'bring', 'supply', 'and', 'demand', 'into', 'balance',\n", + " 'has', 'been', 'the', 'gradual', 'market', 'absorption', 'of', 'the', 'extra', 'chemical',\n", + " 'manufacturing', 'capacity', 'created', 'by', 'middle', 'eastern', 'oil', 'producers', 'in',\n", + " 'the', 'early', '1980s', '.', 'finally', ',', 'virtually', 'all', 'major', 'u', '.', 's', '.',\n", + " 'chemical', 'manufacturers', 'have', 'embarked', 'on', 'an', 'extensive', 'corporate',\n", + " 'restructuring', 'program', 'to', 'mothball', 'inefficient', 'plants', ',', 'trim', 'the',\n", + " 'payroll', 'and', 'eliminate', 'unrelated', 'businesses', '.', 'the', 'restructuring', 'touched',\n", + " 'off', 'a', 'flurry', 'of', 'friendly', 'and', 'hostile', 'takeover', 'attempts', '.', 'gaf', ',',\n", + " 'which', 'made', 'an', 'unsuccessful', 'attempt', 'in', '1985', 'to', 'acquire', 'union',\n", + " 'carbide', 'corp', '&', 'lt', ';', 'uk', '>,', 'recently', 'offered', 'three', 'billion', 'dlrs',\n", + " 'for', 'borg', 'warner', 'corp', '&', 'lt', ';', 'bor', '>,', 'a', 'chicago', 'manufacturer',\n", + " 'of', 'plastics', 'and', 'chemicals', '.', 'another', 'industry', 'powerhouse', ',', 'w', '.',\n", + " 'r', '.', 'grace', '&', 'lt', ';', 'gra', '>', 'has', 'divested', 'its', 'retailing', ',',\n", + " 'restaurant', 'and', 'fertilizer', 'businesses', 'to', 'raise', 'cash', 'for', 'chemical',\n", + " 'acquisitions', '.', 'but', 'some', 'experts', 'worry', 'that', 'the', 'chemical', 'industry',\n", + " 'may', 'be', 'headed', 'for', 'trouble', 'if', 'companies', 'continue', 'turning', 'their',\n", + " 'back', 'on', 'the', 'manufacturing', 'of', 'staple', 'petrochemical', 'commodities', ',', 'such',\n", + " 'as', 'ethylene', ',', 'in', 'favor', 'of', 'more', 'profitable', 'specialty', 'chemicals',\n", + " 'that', 'are', 'custom', '-', 'designed', 'for', 'a', 'small', 'group', 'of', 'buyers', '.', '\"',\n", + " 'companies', 'like', 'dupont', '&', 'lt', ';', 'dd', '>', 'and', 'monsanto', 'co', '&', 'lt', ';',\n", + " 'mtc', '>', 'spent', 'the', 'past', 'two', 'or', 'three', 'years', 'trying', 'to', 'get', 'out',\n", + " 'of', 'the', 'commodity', 'chemical', 'business', 'in', 'reaction', 'to', 'how', 'badly', 'the',\n", + " 'market', 'had', 'deteriorated', ',\"', 'dosher', 'said', '.', '\"', 'but', 'i', 'think', 'they',\n", + " 'will', 'eventually', 'kill', 'the', 'margins', 'on', 'the', 'profitable', 'chemicals', 'in',\n", + " 'the', 'niche', 'market', '.\"', 'some', 'top', 'chemical', 'executives', 'share', 'the',\n", + " 'concern', '.', '\"', 'the', 'challenge', 'for', 'our', 'industry', 'is', 'to', 'keep', 'from',\n", + " 'getting', 'carried', 'away', 'and', 'repeating', 'past', 'mistakes', ',\"', 'gaf', \"'\", 's',\n", + " 'heyman', 'cautioned', '.', '\"', 'the', 'shift', 'from', 'commodity', 'chemicals', 'may', 'be',\n", + " 'ill', '-', 'advised', '.', 'specialty', 'businesses', 'do', 'not', 'stay', 'special', 'long',\n", + " '.\"', 'houston', '-', 'based', 'cain', 'chemical', ',', 'created', 'this', 'month', 'by', 'the',\n", + " 'sterling', 'investment', 'banking', 'group', ',', 'believes', 'it', 'can', 'generate', '700',\n", + " 'mln', 'dlrs', 'in', 'annual', 'sales', 'by', 'bucking', 'the', 'industry', 'trend', '.',\n", + " 'chairman', 'gordon', 'cain', ',', 'who', 'previously', 'led', 'a', 'leveraged', 'buyout', 'of',\n", + " 'dupont', \"'\", 's', 'conoco', 'inc', \"'\", 's', 'chemical', 'business', ',', 'has', 'spent', '1',\n", + " '.', '1', 'billion', 'dlrs', 'since', 'january', 'to', 'buy', 'seven', 'petrochemical', 'plants',\n", + " 'along', 'the', 'texas', 'gulf', 'coast', '.', 'the', 'plants', 'produce', 'only', 'basic',\n", + " 'commodity', 'petrochemicals', 'that', 'are', 'the', 'building', 'blocks', 'of', 'specialty',\n", + " 'products', '.', '\"', 'this', 'kind', 'of', 'commodity', 'chemical', 'business', 'will', 'never',\n", + " 'be', 'a', 'glamorous', ',', 'high', '-', 'margin', 'business', ',\"', 'cain', 'said', ',',\n", + " 'adding', 'that', 'demand', 'is', 'expected', 'to', 'grow', 'by', 'about', 'three', 'pct',\n", + " 'annually', '.', 'garo', 'armen', ',', 'an', 'analyst', 'with', 'dean', 'witter', 'reynolds', ',',\n", + " 'said', 'chemical', 'makers', 'have', 'also', 'benefitted', 'by', 'increasing', 'demand', 'for',\n", + " 'plastics', 'as', 'prices', 'become', 'more', 'competitive', 'with', 'aluminum', ',', 'wood',\n", + " 'and', 'steel', 'products', '.', 'armen', 'estimated', 'the', 'upturn', 'in', 'the', 'chemical',\n", + " 'business', 'could', 'last', 'as', 'long', 'as', 'four', 'or', 'five', 'years', ',', 'provided',\n", + " 'the', 'u', '.', 's', '.', 'economy', 'continues', 'its', 'modest', 'rate', 'of', 'growth', '.',\n", + " ''],\n", + " ['', 'turkey', 'calls', 'for', 'dialogue', 'to', 'solve', 'dispute', 'turkey', 'said',\n", + " 'today', 'its', 'disputes', 'with', 'greece', ',', 'including', 'rights', 'on', 'the',\n", + " 'continental', 'shelf', 'in', 'the', 'aegean', 'sea', ',', 'should', 'be', 'solved', 'through',\n", + " 'negotiations', '.', 'a', 'foreign', 'ministry', 'statement', 'said', 'the', 'latest', 'crisis',\n", + " 'between', 'the', 'two', 'nato', 'members', 'stemmed', 'from', 'the', 'continental', 'shelf',\n", + " 'dispute', 'and', 'an', 'agreement', 'on', 'this', 'issue', 'would', 'effect', 'the', 'security',\n", + " ',', 'economy', 'and', 'other', 'rights', 'of', 'both', 'countries', '.', '\"', 'as', 'the',\n", + " 'issue', 'is', 'basicly', 'political', ',', 'a', 'solution', 'can', 'only', 'be', 'found', 'by',\n", + " 'bilateral', 'negotiations', ',\"', 'the', 'statement', 'said', '.', 'greece', 'has', 'repeatedly',\n", + " 'said', 'the', 'issue', 'was', 'legal', 'and', 'could', 'be', 'solved', 'at', 'the',\n", + " 'international', 'court', 'of', 'justice', '.', 'the', 'two', 'countries', 'approached', 'armed',\n", + " 'confrontation', 'last', 'month', 'after', 'greece', 'announced', 'it', 'planned', 'oil',\n", + " 'exploration', 'work', 'in', 'the', 'aegean', 'and', 'turkey', 'said', 'it', 'would', 'also',\n", + " 'search', 'for', 'oil', '.', 'a', 'face', '-', 'off', 'was', 'averted', 'when', 'turkey',\n", + " 'confined', 'its', 'research', 'to', 'territorrial', 'waters', '.', '\"', 'the', 'latest',\n", + " 'crises', 'created', 'an', 'historic', 'opportunity', 'to', 'solve', 'the', 'disputes', 'between',\n", + " 'the', 'two', 'countries', ',\"', 'the', 'foreign', 'ministry', 'statement', 'said', '.', 'turkey',\n", + " \"'\", 's', 'ambassador', 'in', 'athens', ',', 'nazmi', 'akiman', ',', 'was', 'due', 'to', 'meet',\n", + " 'prime', 'minister', 'andreas', 'papandreou', 'today', 'for', 'the', 'greek', 'reply', 'to', 'a',\n", + " 'message', 'sent', 'last', 'week', 'by', 'turkish', 'prime', 'minister', 'turgut', 'ozal', '.',\n", + " 'the', 'contents', 'of', 'the', 'message', 'were', 'not', 'disclosed', '.', '']]\n" + ] + } + ], + "source": [ + "reuters_corpus = read_corpus()\n", + "pprint.pprint(reuters_corpus[:3], compact=True, width=100)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 1.1: Implement `distinct_words` [code] (2 points)\n", + "\n", + "Write a method to work out the distinct words (word types) that occur in the corpus. You can do this with `for` loops, but it's more efficient to do it with Python list comprehensions. In particular, [this](https://coderwall.com/p/rcmaea/flatten-a-list-of-lists-in-one-line-in-python) may be useful to flatten a list of lists. If you're not familiar with Python list comprehensions in general, here's [more information](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html).\n", + "\n", + "You may find it useful to use [Python sets](https://www.w3schools.com/python/python_sets.asp) to remove duplicate words." + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [], + "source": [ + "def distinct_words(corpus):\n", + " \"\"\" Determine a list of distinct words for the corpus.\n", + " Params:\n", + " corpus (list of list of strings): corpus of documents\n", + " Return:\n", + " corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)\n", + " num_corpus_words (integer): number of distinct words across the corpus\n", + " \"\"\"\n", + " corpus_words = []\n", + " num_corpus_words = -1\n", + " \n", + " # ------------------\n", + " # Write your implementation here.\n", + " corpus_words = [word for document in corpus for word in document ] \n", + " corpus_words = set(corpus_words)\n", + " corpus_words = sorted(list(corpus_words))\n", + " num_corpus_words = len(corpus_words)\n", + " # ------------------\n", + "\n", + " return corpus_words, num_corpus_words" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--------------------------------------------------------------------------------\n", + "Passed All Tests!\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# ---------------------\n", + "# Run this sanity check\n", + "# Note that this not an exhaustive check for correctness.\n", + "# ---------------------\n", + "\n", + "# Define toy corpus\n", + "test_corpus = [\"START All that glitters isn't gold END\".split(\" \"), \"START All's well that ends well END\".split(\" \")]\n", + "test_corpus_words, num_corpus_words = distinct_words(test_corpus)\n", + "\n", + "# Correct answers\n", + "ans_test_corpus_words = sorted(list(set([\"START\", \"All\", \"ends\", \"that\", \"gold\", \"All's\", \"glitters\", \"isn't\", \"well\", \"END\"])))\n", + "ans_num_corpus_words = len(ans_test_corpus_words)\n", + "\n", + "# Test correct number of words\n", + "assert(num_corpus_words == ans_num_corpus_words), \"Incorrect number of distinct words. Correct: {}. Yours: {}\".format(ans_num_corpus_words, num_corpus_words)\n", + "\n", + "# Test correct words\n", + "assert (test_corpus_words == ans_test_corpus_words), \"Incorrect corpus_words.\\nCorrect: {}\\nYours: {}\".format(str(ans_test_corpus_words), str(test_corpus_words))\n", + "\n", + "# Print Success\n", + "print (\"-\" * 80)\n", + "print(\"Passed All Tests!\")\n", + "print (\"-\" * 80)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 1.2: Implement `compute_co_occurrence_matrix` [code] (3 points)\n", + "\n", + "Write a method that constructs a co-occurrence matrix for a certain window-size $n$ (with a default of 4), considering words $n$ before and $n$ after the word in the center of the window. Here, we start to use `numpy (np)` to represent vectors, matrices, and tensors. If you're not familiar with NumPy, there's a NumPy tutorial in the second half of this cs231n [Python NumPy tutorial](http://cs231n.github.io/python-numpy-tutorial/).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [], + "source": [ + "def compute_co_occurrence_matrix(corpus, window_size=4):\n", + " \"\"\" Compute co-occurrence matrix for the given corpus and window_size (default of 4).\n", + " \n", + " Note: Each word in a document should be at the center of a window. Words near edges will have a smaller\n", + " number of co-occurring words.\n", + " \n", + " For example, if we take the document \"START All that glitters is not gold END\" with window size of 4,\n", + " \"All\" will co-occur with \"START\", \"that\", \"glitters\", \"is\", and \"not\".\n", + " \n", + " Params:\n", + " corpus (list of list of strings): corpus of documents\n", + " window_size (int): size of context window\n", + " Return:\n", + " M (numpy matrix of shape (number of corpus words, number of number of corpus words)): \n", + " Co-occurence matrix of word counts. \n", + " The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.\n", + " word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.\n", + " \"\"\"\n", + " words, num_words = distinct_words(corpus)\n", + " M = None\n", + " word2Ind = {}\n", + " \n", + " # ------------------\n", + " # Write your implementation here.\n", + "\n", + " indexs = [x for x in range(0,len(words))]\n", + " word2Ind = dict(zip(words,indexs))\n", + " M = np.zeros((num_words,num_words))\n", + " for document in corpus:\n", + " len_doc = len(document)\n", + " for index in range(0,len_doc):\n", + " center_index = word2Ind[document[index]]\n", + " for i in range(index-window_size,index+window_size+1):\n", + " if i>=0 and i" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# ---------------------\n", + "# Run this sanity check\n", + "# Note that this not an exhaustive check for correctness.\n", + "# The plot produced should look like the \"test solution plot\" depicted below. \n", + "# ---------------------\n", + "\n", + "print (\"-\" * 80)\n", + "print (\"Outputted Plot:\")\n", + "\n", + "M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])\n", + "word2Ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}\n", + "words = ['test1', 'test2', 'test3', 'test4', 'test5']\n", + "plot_embeddings(M_reduced_plot_test, word2Ind_plot_test, words)\n", + "\n", + "print (\"-\" * 80)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Test Plot Solution**\n", + "
\n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)\n", + "\n", + "Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 5, over the Reuters \"crude\" corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U\\*S, so we normalize the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is directional closeness). **Note**: The line of code below that does the normalizing uses the NumPy concept of *broadcasting*. If you don't know about broadcasting, check out\n", + "[Computation on Arrays: Broadcasting by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html).\n", + "\n", + "Run the below cell to produce the plot. It'll probably take a few seconds to run. What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have? **Note:** \"bpd\" stands for \"barrels per day\" and is a commonly used abbreviation in crude oil topic articles." + ] + }, + { + "cell_type": "code", + "execution_count": 193, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Running Truncated SVD over 8185 words...\n", + "Done.\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnEAAAEyCAYAAACVqYZnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3XmcFdWd///Xh02UEZstqCCiQcElhmhLhMGIiMpIAtERNfFnwAGMycQhHRcwfL8McTCayUS+HZMxQRLRIcYlMcJPR4yiTcIXktjMYHAHDYMsImFRGTNsfb5/3NtwabvZbjdN0a/n43Eft5ZTVae6aH33OaeqIqWEJEmSsqVZY1dAkiRJ+84QJ0mSlEGGOEmSpAwyxEmSJGWQIU6SJCmDDHGSJEkZZIiTJEnKIEOcJElSBhniJEmSMqhFY1dgf3Ts2DF17969sashSZK0RwsXLvxzSqlTfe83kyGue/fuVFZWNnY1JEmS9igi/qsh9mt3qiRJUgYZ4iRJkjLIECdJkpRBhjhJkqQMMsRJkiRlkCFOkiQpgwxxkiTpkDB9+nTef//9fdqmR48eDVSbhmeIkyRJh4S6Qtz27dsboTYNL5MP+5UkSU3DsmXLuOyyyzjppJN48803ueaaaxg5ciRjxoxh3bp1pJSYOnUqy5cvZ9GiRQwfPpzS0lJuvPFGhg8fTq9evWjZsiV33HEHI0eO5MMPP6RNmzbcf//9dOq08yUKW7du5Stf+QpvvvkmW7du5a677qJPnz4MGDCAGTNm0LVrVyZPnkzXrl0ZOXIkPXr04NJLL2XevHmcddZZHHPMMTz99NO0a9eOxx9/nIho8J+NLXGSJKnhpbT7+d14++23mTZtGgsWLOC+++6jrKyMyy67jDlz5jBlyhTGjx/PwIED6d27N48++ih33303kAuAP/zhD/npT3/KHXfcwRe+8AXmzp3LVVddxR133LHLMX7yk5/Qo0cPnn/+eX75y19SVla22zpt27aNa665hgULFjBnzhxOOeUUfvOb3xARLFq0aK/PrRi2xEmSpIY1aRJs3AhTpkBELsCVlUFJSW7dHvTq1YsjjzwSgNNPP53Vq1dTXl7Oj370IwBatKg9zpx++um0bdsWgNdff52vfe1rAPTr14+HHnpol7KLFy9m/vz5zJ49G4D33nsPYJcWtVQQPFu0aMEZZ5wBQJcuXfjUpz4FQNeuXVm/fv0ez6k+GOIkSVLDSSkX4MrLc/NTpuQCXHk5jB2bW7+HrsfXXnuNTZs20bp1a1566SXOPPNMrrvuOi699FIAtmzZAkCrVq3Ytm3bju2aN2++Y7pnz57Mnz+fHj16MH/+fHr27LnLMU477TR69OixowWuep/t27dnxYoVdO3alYULF3LcccfVWse6wl5DMsRJkqSGE5ELbpALbtVhbuzYnS1ze9C9e3fGjBnDkiVLGDFiBH/3d3/H9ddfz913301KiSFDhnDTTTdx2WWXMWrUKPr168eoUaN22cf48eMZMWIE06ZN44gjjuCBBx7YZf2YMWO44YYbOP/88wEoLS3lu9/9Lv/wD//A6NGjOfnkkznssMOK/3nUozhQabE+lZaWpsrKysauhiRJ2lspQbOCofhVVXsV4JYtW8bo0aN59tlnG7ByDSsiFqaUSut7v97YIEmSGlb1GLhCZWX7dHODPsoQJ0mSGk51gKseA1dVlfsuL9+rINe9e/dMt8I1JMfESZKkhhORuwu1cAxc9Ri5kpK96lJV7RwTJ0mSGl7Nu1D34q7UQ4Vj4iRJUnbVDGxNJMA1JEOcJElSBhniJEmSMsgQJ0mSDmnvvPMON954416VHT16NBUVFfu0/8cff5zly5fvR82KY4iTJEmHtKOPPprvfe97Dbb/ukLc9u3bG+yY4CNGJEnSIa76rQ/9+/dnyZIlfPDBByxfvpyHHnqIXr168eijj3L77bdzwgknsHHjxl22qX5GXY8ePVi6dCkVFRXccssttGnThu7du3PzzTcze/ZsXnzxRXr06MGjjz7K8ccfz5AhQ1i+fDlt2rQBOBwgIo4HpqWULqyP8zLESZKkbKiHx5R06tSJn/3sZzz44INMmzaN73znO0yYMIGFCxfSunVrPvnJT+52+8cee4zJkydz0UUXUVVVRbNmzRg8ePCOkAiwevVqxo8fT7du3ZgzZw6PPPJIp/zm1wI/2acK74bdqZIk6eA3adKub3iofhPEpEn7tJuzzjoLgG7durFu3Tr+/Oc/07lzZ4488khatmzJmWeeCUDUCIfVz9W9+eabmTVrFldffTX33Xdfrcfo0qUL3bp1A2DgwIEAbSLiCOBzwK/2qcK7YUucJEk6uKUEGzfmXtUFuTc+FL7Kax9a5ArDWUqJjh07smbNGjZt2kTr1q1ZtGgRAO3atWPVqlWklFizZg0rV64EoEOHDvzgBz8gpcTJJ5/M8OHDadWqFdu2bdux3+bNm9c83gbgX4HfpJQ2F/OjKFQvIS4iBgPlQHNyfb131lh/GPAAcBawDrgypbQsIroDrwKv54v+LqV0fX3USZIkHSIKX9VVXr4zzBW+yms/NW/enNtuu43+/ftzwgkn0KVLFwDatm3L4MGD6du3L3369KFz584A3HXXXfz617+mqqqKCy+8kLZt2/LZz36WiRMncsopp/DjH/+4tsP8Gfgi8Kn9rmgtin7tVkQ0B94ALgRWAC8AX0gpvVJQ5qvAGSml6yPiKuDSlNKV+RD3RErp9H05pq/dkiSpCUoJmhWMBKuqysSbHyLiRWB9Smlgfe63PsbE9QGWppTeSiltAR4ChtUoMwy4Pz/9C+CCqNnZLEmSVJfqMXCFCsfIHaSeeeYZgB7A5Pred32EuC7A2wXzK/LLai2TUtoGvAd0yK87ISL+MyLmRsS5dR0kIq6LiMqIqFy7dm09VFuSJGVCdYCrHgNXVZX7Li8/6IPchRdeCPBaSum5+t53Y9/YsBrollJaFxFnAY9HxGkppfdrFkwpTQWmQq479QDXU5IkNZYIKCnZdQxc9Ri5kpJMdKk2hPoIcSuB4wrmu+aX1VZmRUS0AI4C1qXcgLzNACmlhRHxJnAy4IA3SZK006RJu96FWh3kmmiAg/rpTn0BOCkiToiIVsBVwKwaZWYBI/LTlwPPpZRSRHTK3xhBRJwInAS8VQ91kiRJh5qaga0JBzioh5a4lNK2iPga8DS5R4z8NKX0ckTcBlSmlGaRezrxv0XEUmA9uaAH8BngtojYClQB16eU1hdbJ0mSpENd0Y8YaQw+YkSSJGVFRCxMKZXW93597ZYkSVIGGeIkSZIyyBAnSZKUQYY4SZKkDDLESZIkZZAhTpIkKYMMcZIkSRlkiJMkScogQ5wkSVIGGeIkSZIyyBAnSZKUQYY4SZKkDDLESZIkZZAhTpIkKYMMcZIkSRlkiJMkScogQ5wkSVIGGeIkSZIyyBAnSZKUQYY4SZKkDDLESZIkZZAhTpIkKYMMcZIkSRlkiJMkScqgeglxETE4Il6PiKURMb6W9YdFxMP59b+PiO4F627NL389Ii6uj/pIkiQd6ooOcRHRHPgh8DfAqcAXIuLUGsVGARtSSj2AKcB38tueClwFnAYMBv41vz9JkiTtRn20xPUBlqaU3kopbQEeAobVKDMMuD8//QvggoiI/PKHUkqbU0p/Apbm9ydJkqTdqI8Q1wV4u2B+RX5ZrWVSStuA94AOe7ktABFxXURURkTl2rVr66HakiRJB1Z99ji2qK8dNbSU0lRgKkBpaWlq5OpIkqQm5NZbb2X+/Pls2bKFCRMmUFlZyZIlS/jggw9Yvnw5Dz30EL169WLu3LlMnDiRiKBXr17cc889AK0i4gXgNWBrRPxv4OfAh8B/AYcBZcBTKaVzAPJllqWU/q2uOtVHS9xK4LiC+a75ZbWWiYgWwFHAur3cVpIkqXgp7X6+DrNnz2bDhg3MnTuXOXPmMGHCBFJKdOrUiVmzZnHLLbcwbdo0Ukp8/etfZ9asWVRUVHD44Yfz5JNPVu+mO/D3KaW/A8YB/5pSGgwsz1UlbQCWRERpfsjZ58kNQatTfbTEvQCcFBEnkAtgVwFfrFFmFjACWABcDjyXUkoRMQt4MCLuAo4FTgL+UA91kiRJ2mnSJNi4EaZMgYhcgCsrg5KS3LrdWLx4MXPnzmXAgAEAbN68mXXr1vHpT38agG7duvHMM8/w5z//mWXLljFsWO7WgE2bNtGzZ8/q3byUUno/P30SUJ6f/n1+HnI9jqOBtsCClNJfdlevolvi8mPcvgY8DbwKPJJSejkibouIofliPwE6RMRS4BvA+Py2LwOPAK8As8kl1O3F1kmSJGmHlHIBrrw8F9yqA1x5eW75HlrkTjvtNC666CIqKiqoqKjgj3/8Ix07diTXYFZ9iETHjh058cQTeeKJJ6ioqKCyspJRo0ZVFynMN0uB0vz02QX7+C3QG7gBuHdPp1UvY+JSSv8O/HuNZRMLpv8HGF7HtrcDt9dHPSRJkj4iItcCB7ngVp5vBBs7dmfL3G5ccsklzJ8/nwEDBhARdO3alY9//OO1HCa46667GDp0KCklmjVrxpTq4+7qO8DPI+LvgFXAloJ1DwNfTCm9uMfTSnvZH3wwKS0tTZWVlY1dDUmSlCUpQbOCTsiqqj0GuPoQEQtTSqUF882BqvzQsgnA5pTSv+TXfR3475TSHlvifO2WJEk69FV3oRaq7lo98DoDv4mIeUB/8l2nEfEdYCgwY292kplHjEiSJO2XwjFw1V2o1fOwV12q9VudtAo4t5bl4/ZlP4Y4SZJ0aIvI3YVaOAaueqxaSckBDXD1yTFxkiSpaUhp18BWc76B1BwTV18cEydJkpqGmoEtoy1w1QxxkiRJGWSIkyRJyiBDnCRJUgYZ4iRJkjLIECdJkpRBhjhJkqQMMsRJkiRlkCFOkiQpgwxxkiRJGWSIkyRJyiBDnCRJUgYZ4iRJkjLIECdJkpRBhjhJkqQMMsRJkqQm7c4772Tx4sUA9OjRo5Frs/daNHYFJEmSGtP48eMbuwr7xZY4SZKUPSntfr7OzRJf/vKX6d+/P/369eMPf/gDI0eOZN68eQ1QyYZlS5wkScqWSZNg40aYMgUicgGurAxKSnLrdmPmzJls3bqVefPm8dZbb3HVVVdx6qmnHpBq1zdb4iRJUnaklAtw5eW54FYd4MrLc8v30CL3+uuv069fPwBOPPFENmzYcCBq3SCKCnER0T4inomIJfnvdnWUG5EvsyQiRhQsr4iI1yNiUf7zsWLqI0mSDnERuRa4sWNzwa1Zs9z32LE7W+Z2o2fPnsyfPx+At956i5KSkgNR6wZRbEvceGBOSukkYE5+fhcR0R74R+DTQB/gH2uEvatTSr3zn3eLrI8kSTrUVQe5QnsR4ACGDh1K8+bN6d+/P1dffTV33313A1Wy4RU7Jm4YMCA/fT9QAYyrUeZi4JmU0nqAiHgGGAz8vMhjS5Kkpqi6C7VQWdleBblmzZpx77337rLsnHPO2TG9dOnSeqtmQyu2Ja5zSml1fvodoHMtZboAbxfMr8gvq3Zfviv1f0fU/ZOPiOsiojIiKteuXVtktSVJUiYVjoEbOxaqqnZ2rVaPkWsi9tgSFxHPAkfXsmpC4UxKKUXEvv7krk4prYyII4FfAtcAD9RWMKU0FZgKUFpa2nSukCRJ2ikidxdq4Ri46q7VkpK96lI9VOwxxKWUBtW1LiLWRMQxKaXVEXEMUNuYtpXs7HIF6Equ25WU0sr89wcR8SC5MXO1hjhJkiQg9xiRlHYGtuog14QCHBTfnToLqL7bdAQws5YyTwMXRUS7/A0NFwFPR0SLiOgIEBEtgc8CLxVZH0mS1BTUDGxNLMBB8SHuTuDCiFgCDMrPExGlETENIH9Dwz8BL+Q/t+WXHUYuzP0RWESuxe7ejx5CkiRJNUXK4ADA0tLSVFlZ2djVkCRJ2qOIWJhSKq3v/frGBkmSpAwyxEmSJGWQIU6SJCmDDHGSJEkZZIiTJEnKIEOcJElSBhniJEmSMsgQJ0mSlEGGOEmSpAwyxEmSJGWQIU6SJCmDDHGSJEkZZIiTJEnKIEOcJElSBhniJEmSMsgQJ0mSMmvFihUMGDCgsavRKAxxkiSpSdi+fXtjV6FeGeIkSVKDuvXWWznvvPPo27cvTzzxBMuXL2fw4MGcd955DBo0iKqqKkaOHMm8efMAmDFjBpMmTQJg3LhxnH/++Zx55plMnToVgE2bNjFkyBAGDRrEt7/97R3HeeONNxgwYADnnXceV155JX/5y18AOP744/nqV7/KsGHDDuyJN7AWjV0BSZKUASlBRN3zdZg9ezYbNmxg7ty5fPjhh/Tt25eTTz6ZsrIyLr74YqqqqmjWrO42pYkTJ9KmTRs2b97MJz7xCa699lruvfde+vfvz6233srPfvYzXnnlFQBuueUWbrvtNj7zmc9w2223ce+99/If//EfrFy5kvHjx9OtW7eifwwHE1viJEnS7k2aBGVlueAGue+ystzyPVi8eDFz585lwIABXHLJJWzevJlXXnmFgQMHAuwIcFEQCFP1cYB77rmH/v37c9FFF/Huu+/y7rvv8sYbb9CnTx8APv3pT+8o+8Ybb9CvXz8A+vXrx2uvvQZAp06dDrkAB4Y4SZK0OynBxo1QXr4zyJWV5eY3btwR7G688UZmzpwJwF/+8hd69+7N3LlzeeCBB/jggw/o1asXzz//PLNmzWLVqlUMHjyYM888kylTpgDQpk0bbrrpJi644AK++c1vsn79el5++WX+8R//kebNm7N27Vq2bNnC8uXLmTNnDpWVlQD0798fgK1bt/Lf//3fnH322fTv359HHnmEnj17AjuD4sUXX8yAAQPo06cPCxYsOKA/xoZgiJMkSbsqaAkjAu66C8aOzQW3Zs1y32PHwpQpO7pUv/SlL/HAAw8AMHPmTD73uc/x9a9/nXnz5jFy5EieeOIJzjjjDMaNG0erVq2ICNq0acM3v/lNqqqq2Lx5M6tXr6ZNmzacddZZPPvss5x66qlccsklfPjhh6xfv57jjz8egC5dulBRUcGgQYP4n//5HwB+8pOfcPnll9O2bVu2bdvGY489xnXXXbfLaT322GNUVFRw//33M2HChAPwg2xYhjhJkrRTbV2n3/gGHHXUruUKAhzAJz/5SVasWMGGDRuYMWMGQ4cOZdmyZQwbNox58+Zx9NFH8/d///dMmTKF008/nWeffZbf/va3dOnShWbNmrFq1SqOPfZY3n//fTZu3Mixxx5LRDBjxgw6dOjAI488wquvvsqxxx5L8+bNeeqpp3j22Wfp0KEDFRUVLF68mOeee46IoHXr1hx99NEcfvjhADz88MP85S9/YezYsZx77rl85Stf4e233z5AP9CG440NkiQpp7DrFHJBrbrrtHfvXcuWlX0kyF155ZWUl5ezadMmSktLOfHEE3niiSf4q7/6KyDX5bly5cpdxr9VO+200+jbty+XXnopAFu2bCGlxLXXXsvo0aP5zGc+A0C7du1YtWoVKSXWrFnDypUrd2zfo0cPysrKdmxfaPbs2TRv3pzf/va3vPLKKwwdOrT4n1cjKyrERUR74GGgO7AMuCKltKGWcrOBc4B5KaXPFiw/AXgI6AAsBK5JKW2pub0kSToAInLBDHLBrTrM9e4Nixbt7EKtDnawS5C7+uqrOf744ykvLyciuOuuuxg6dCgpJZo1a8aUKVNo27ZtrYeeMGEC119/PXfffTcpJYYMGcKnP/1pnnzySVatWsUPfvAD+vfvz+TJkxk8eDB9+/alT58+dO7cGYAxY8Zwww03cP755wNQWlrKd7/73R3779u3L3fccQeDBg3ir//6rxvgh3fgReEdIPu8ccQ/A+tTSndGxHigXUppXC3lLgCOAL5cI8Q9AjyWUnooIn4EvJhSumdPxy0tLU3VAxolSVI9Syk39q3axInw3ns7A1v1zQ0lJXt1h2pTFxELU0ql9b3fYsfEDQPuz0/fD3y+tkIppTnAB4XLIteWOhD4xZ62lyRJB0h1QCv03nu5mxuqu0GrW+wMcI2q2BDXOaW0Oj/9DtB5H7btAGxMKW3Lz68AutRVOCKui4jKiKhcu3bt/tVWkiTVrfDxIWPHQlXVzrtSv/GNj961qka1xzFxEfEscHQtq3a5NzellCJi//tm9yClNBWYCrnu1IY6jiRJTVZErou08PEh1WPkSkoMbgeZPYa4lNKgutZFxJqIOCaltDoijgHe3YdjrwNKIqJFvjWuK7ByH7aXJEn1bdKkXV+pVR3kDHAHnWK7U2cBI/LTI4CZe7thyt1R8Txw+f5sL0mS9s+yZcsYNKjONpqPBrZ9CHCLFi3acVfo448/zvLly/enitoLxYa4O4ELI2IJMCg/T0SURsS06kIR8VvgUeCCiFgRERfnV40DvhERS8mNkftJkfWRJEmNqHfv3tx8882AIa6hFRXiUkrrUkoXpJROSikNSimtzy+vTCmNLih3bkqpU0rp8JRS15TS0/nlb6WU+qSUeqSUhqeUNhd3OpIkaV/84Ac/4Ctf+QonnHDCjmWDBg1i2bJlfOtb3+JXv/oVKSU+9rGP8dRTT7F9+3ZKS3NPyxg3bhznn38+Z555JlOnTgWgoqKC0aNH88orrzB79mxuuOEGhg8f3ijndqjzjQ2SJGVR4bi12ub3wq233krr1q2555576NGjx0fWDxw4kEceeYQTTzyRvn378txzz9G+fXvOOussACZOnEibNm3YvHkzn/jEJ7j22mt3bHvqqacyePBgRo8eveMl9apfhjhJkrJm0qTc67GKePjuyy+/zIYNG1iwYMFH1lW/COCcc87hxhtv5OMf/zhf+9rXKC8v5/nnn2fgwIEA3HPPPTz++OM0b96cd999l3ff3Zf7G1WsYsfESZKkA6nw/abVL6qvfrbbxo27PsttN0477TRuvfVWrrjiCjZv3kxVVRWbN2/mww8/5NVXXwWgZcuWdOjQgV/+8pf079+fDh068Nhjj3H++eezYcMG7rvvPubOncvTTz/NUUcdtSP8VWvVqhXbtm2r7fCqB7bESZKUJXW937Tw2W576fLLL6dVq1ZcfvnljBo1inPOOYfevXvTtWvXHWUGDhzIE088weGHH86AAQNYuHAhH/vYx0gpceqpp9K/f39OOeUUOnTo8JH9f/azn2XixImccsop/PjHPy7qtPVRRb07tbH47lRJUpNX8/2mVVU+y+0gdbC+O1WSJB1otb3ftLprVU2GIU6SpCzZ3ftNDXJNimPiJEnKEt9vqjzHxEmSlEX18Jw4HRiOiZMkSTsV8X5THRoMcZIkSRlkiJMkScogQ5wkSVIGGeIkSZIyyBAnSZKUQYY4SZKkDDLESZIkZZAhTpIkKYMMcZIkSRlkiJMkScogQ5wkSVIGGeIkSZIyyBAnSZKUQYY4SZKkDCoqxEVE+4h4JiKW5L/b1VFudkRsjIgnaiyfHhF/iohF+U/vYuojSZLUVBTbEjcemJNSOgmYk5+vzXeBa+pYd3NKqXf+s6jI+kiSJDUJxYa4YcD9+en7gc/XViilNAf4oMhjSZIkKa/YENc5pbQ6P/0O0Hk/9nF7RPwxIqZExGF1FYqI6yKiMiIq165du1+VlSRJOlTsMcRFxLMR8VItn2GF5VJKCUj7ePxbgV7A2UB7YFxdBVNKU1NKpSml0k6dOu3jYSRJkg4tLfZUIKU0qK51EbEmIo5JKa2OiGOAd/fl4AWteJsj4j7gpn3ZXpIkqakqtjt1FjAiPz0CmLkvG+eDHxER5MbTvVRkfSRJkpqEYkPcncCFEbEEGJSfJyJKI2JadaGI+C3wKHBBRKyIiIvzq34WEYuBxUBHYHKR9ZEkSWoS9tidujsppXXABbUsrwRGF8yfW8f2A4s5viRJUlPlGxskSZIyyBAnSZKUQYY4SZKkDDLESZIkZZAhTpIkKYMMcZIkSRlkiJMkScogQ5wkqUF8//vf3+9tp0+fzvvvv1+PtZEOPYY4SVKDMMRJDcsQJ0naayklvvzlL9O/f3/69evHH/7wBwYMGMCKFSsAmDx5MtOnT+fBBx9k5cqVDBgwgNtvv52KigrOP/98Lr30Unr37s2jjz4KwMiRI5k3bx4AM2bMYNKkSTz33HMsWrSI4cOHc8MNNzTauUoHu6JeuyVJyqiUIKLu+TrMnDmTrVu3Mm/ePN566y2uuuoqjjjiiI+U++IXv8jEiROpqKgAoKKigrVr1/LMM8/w4YcfUlpayt/+7d/WeoyBAwfSu3dvZsyYQdeuXffr9KSmwJY4SWpqJk2CsrJccIPcd1lZbvkevP766/Tr1w+AE088kQ0bNhAF4S9V77MWn/rUp2jRogVt27blYx/7GGvXrt3rbSV9lCFOkpqSlGDjRigv3xnkyspy8xs37gx2dejZsyfz588H4K233qKkpIT27dvv6E5duHDhjrItWrSgqqpqx/yiRYvYtm0bH3zwAWvWrKFTp051btuqVSu2bdtWb6ctHYrsTpWkpiQCpkzJTZeX5z4AY8fmlu+hS3Xo0KE8+eST9O/fn+3bt3P33XezefNmRo8ezcknn8xhhx22o+zll1/OkCFD+Ju/+RvOOOMMjj32WIYPH86f/vQnJk+eTLNmzRg9ejRf+MIXePDBB+nYsSMlJSUAXHbZZYwaNYp+/frxT//0Tw3yo5CyLrLYfF1aWpoqKysbuxqSlF0pQbOCzpiqqr0aE7e/KioqmDFjBtOmTWuwY0gHq4hYmFIqre/92p0qSU1NdRdqocIxcpIywRAnSU1J4Ri4sWNzLXBjx+46Rq4BDBgwwFY4qZ45Jk6SmpIIKCnZdQxc9Ri5kpIG7VKVVL8cEydJTdF+PidO0r5zTJwkqf7UDGwGOClzDHGSJEkZZIiTJEnKIEOcJElSBhniJEmSMqioEBcR7SPimYhYkv9uV0uZ3hGxICJejog/RsSVBetOiIjfR8TSiHg4IloVUx9JkqSmotiWuPHAnJTSScCc/HxNHwJfSimdBgwG/k9ElOTXfQeYklLqAWwARhVZH0mSpCah2BA3DLg/P30/8PmaBVJKb6SUluSnVwHvAp0iIoCBwC92t70ZFQi3AAAOhklEQVQkSZI+qtgQ1zmltDo//Q7QeXeFI6IP0Ap4E+gAbEwpbcuvXgF0KbI+kiRJTcIeX7sVEc8CR9eyakLhTEopRUSdr3+IiGOAfwNGpJSqYh8fLBkR1wHXAXTr1m2ftpUkSTrU7DHEpZQG1bUuItZExDEppdX5kPZuHeXaAk8CE1JKv8svXgeURESLfGtcV2DlbuoxFZgKuddu7anekiRJh7Jiu1NnASPy0yOAmTUL5O84/RXwQEqpevwbKffS1ueBy3e3vSRJkj6q2BB3J3BhRCwBBuXniYjSiJiWL3MF8BlgZEQsyn9659eNA74REUvJjZH7SZH1kSRJahKKCnEppXUppQtSSiellAallNbnl1emlEbnp2eklFqmlHoXfBbl172VUuqTUuqRUhqeUtpc/ClJasqWLVvGoEF1jgIp2oABA1ixYkWD7V+S9pZvbJDU5FVVVe0yv3379kaqiSTtvT3e2CBJWbN+/XquvPJK3nzzTa655hrOOOMMbrvtNrZt20b79u15+OGHad26NT169OCKK65gwYIF/PCHP2TEiBH06tWLli1bMmXKFMaMGcO6detIKTF16lR69Oix4xgvv/wyo0ePpnXr1rRu3ZqnnnqqEc9YUlNkiJN0cEoJCh9FVHN+N95++23mzp1L69atOfvss5k5cybPP/88AOPGjeORRx7hS1/6Etu2beNzn/sc3/72t1m2bBnLli1jzpw5tG3blvHjx3PZZZdx1VVX8eKLLzJ+/Hh+8Ysd92bx9NNPc+2113Ldddd9pCVPkg4EQ5ykg8+kSbBxI0yZkgtuKUFZGZSU5NbtQa9evTjyyCMBOP3003nnnXcYM2YMmzdvZs2aNbRt2xaA5s2bc8455+zY7vTTT9+xbvHixcydO5cf/ehHALRoset/Lq+99lpuv/12rr76as444wzGjRtXDycuSXvPECfp4JJSLsCVl+fmp0zJBbjychg7dq9a5F577TU2bdpE69ateemll5g0aRLf+ta36Nu3L7fccgu5JxxBRFD44PHmzZvvmD7ttNPo27cvl156KQBbtmzZ5RiHHXYY//Iv/wLAoEGDuOSSS/jEJz5R9OlL0t4yxEk6uETkghvkglt1mBs7dmfL3B50796dMWPGsGTJEkaMGMHRRx/NqFGj6NmzJ0cdddSO1rbdmTBhAtdffz133303KSWGDBnCTTfdtGP9z3/+c6ZPn05EcPTRR9OzZ8/9Ol1J2l9R/RdplpSWlqbKysrGroakhpQSNCu4gb6qaq/HxEnSwSQiFqaUSut7vz5iRNLBp3oMXKGystxySRJgiJN0sKkOcNVj4Kqqct/l5QY5SSrgmDhJB5eI3F2ohWPgqsfIlZTYpSpJeY6Jk3RwKuI5cZJ0MHFMnKSmpWZgM8BJ0i4McZIkSRlkiJMkScogQ5wkSVIGGeIkSZIyyBAnSZKUQYY4SZKkDDLESZIkZZAhTpIkKYMMcZIkSRlkiJMkScogQ5wkSVIGGeIkSZIyyBAnSZKUQUWFuIhoHxHPRMSS/He7Wsr0jogFEfFyRPwxIq4sWDc9Iv4UEYvyn97F1EeSJKmpKLYlbjwwJ6V0EjAnP1/Th8CXUkqnAYOB/xMRJQXrb04p9c5/FhVZH0mSpCah2BA3DLg/P30/8PmaBVJKb6SUluSnVwHvAp2KPK4kSVKTVmyI65xSWp2ffgfovLvCEdEHaAW8WbD49nw365SIOGw3214XEZURUbl27doiqy1JkpRtewxxEfFsRLxUy2dYYbmUUgLSbvZzDPBvwLUppar84luBXsDZQHtgXF3bp5SmppRKU0qlnTrZkCdJkpq2FnsqkFIaVNe6iFgTEceklFbnQ9q7dZRrCzwJTEgp/a5g39WteJsj4j7gpn2qvSRJUhNVbHfqLGBEfnoEMLNmgYhoBfwKeCCl9Isa647Jfwe58XQvFVkfSZKkJqHYEHcncGFELAEG5eeJiNKImJYvcwXwGWBkLY8S+VlELAYWAx2ByUXWR5IkqUmI3FC2bCktLU2VlZWNXQ1JkqQ9ioiFKaXS+t6vb2yQJEnKIEOcJElSBhniJEmSMsgQJ0mSlEGGOEmSpAwyxEmSJGWQIU6SJCmDDHGSJEkZZIiTJEnKIEOcJElSBhniJEmSMsgQJ0mSlEGGOEmSpAwyxEmSJGWQIa4eLVu2jEGDBu3zdpMnT2b69On1XyFJknTIMsRJkiRlUIvGrsChZv369Vx55ZW8+eabXHPNNRx11FH88pe/BGDFihV8//vf59xzz+U3v/kNY8eOpWvXrgA7viVJkvaGIa42KUFE3fO78fbbbzN37lxat27N2WefzRe/+EW2bt3K7NmzWbZsGZdffjmVlZV84xvfYObMmRx33HFcfPHFDXQikiTpUGV3ak2TJkFZWS64Qe67rCy3fC/06tWLI488kpYtW3L66aeTUuLss88GoHv37rz33nsAvP/++3Tr1o2IoE+fPg1wIpIk6VBmiCuUEmzcCOXlO4NcWVlufuPGncFuN1577TU2bdrEtm3beOmll4gIFi5cCMDy5ctp27YtAEceeSQrVqwA4IUXXmi4c5IkSYcku1MLRcCUKbnp8vLcB2Ds2NzyvehS7d69O2PGjGHJkiWMGDGCdu3accQRRzBkyBBWrVrFlPz+v/e97/G5z32OY489liOPPLKhzkiSJB2iIu1F69LBprS0NFVWVjbcAVKCZgWNlFVVez0mrqbp06ezYsUK/tf/+l/1VDlJkpQlEbEwpVRa3/u1O7Wm6i7UQoVj5CRJkg4CdqcWKhwDV92FWj0Pe92lWmjkyJH1X09JktTkFR3iIqI98DDQHVgGXJFS2lCjzPHAr8i1/LUE7k4p/Si/7ixgOnA48O/A2NRYfbwRUFKy6xi46jFyJSX73aUqSZJU34oeExcR/wysTyndGRHjgXYppXE1yrTKH2tzRPwV8BLQL6W0KiL+APwD8HtyIe77KaWndnfMAzImbj+fEydJklToYB4TNwy4Pz99P/D5mgVSSltSSpvzs4dVHzcijgHappR+l299e6C27Q+4moHNACdJkg4y9RHiOqeUVuen3wE611YoIo6LiD8CbwPfSSmtAroAKwqKrcgvq2376yKiMiIq165dWw/VliRJyq69GhMXEc8CR9eyakLhTEopRUSt/bMppbeBMyLiWODxiPjFvlQ0pTQVmAq57tR92VaSJOlQs1chLqU0qK51EbEmIo5JKa3Od4++u4d9rYqIl4Bzgf8LFL75vSuwcm/qJEmS1JTVR3fqLGBEfnoEMLNmgYjoGhGH56fbAf2B1/PdsO9HxDkREcCXattekiRJu6qPEHcncGFELAEG5eeJiNKImJYvcwrw+4h4EZgL/EtKaXF+3VeBacBS4E1gt3emSpIkydduSZIkNaiD+REjkiRJOsAy2RIXEWuB/2rsejSSjsCfG7sSqlde00OP1/TQ4vU89Bzoa3p8SqlTfe80kyGuKYuIyoZoklXj8Zoeerymhxav56HnULmmdqdKkiRlkCFOkiQpgwxx2TO1sSugeuc1PfR4TQ8tXs9DzyFxTR0TJ0mSlEG2xEmSJGWQIU6SJCmDDHEHkYgYHBGvR8TSiBhfy/rjI2JORPwxIioiomvBum4R8euIeDUiXomI7gey7vqo/b2eEXF+RCwq+PxPRHz+wJ+Bairyd/SfI+Ll/O/o9/Pvi1YjK/KaficiXsp/rjywNVdtIuKnEfFuRLxUx/rI//4tzV/TMwvWjYiIJfnPiNq2P+iklPwcBB+gObl3x54ItAJeBE6tUeZRYER+eiDwbwXrKoAL89N/BRzR2OfUlD/FXs+CMu2B9V7Pxv8Uc02BfsD/ze+jObAAGNDY59TUP0Ve0yHAM0ALoA3wAtC2sc+pqX+AzwBnAi/Vsf4Scu9oD+Ac4Pf55e2Bt/Lf7fLT7Rr7fPb0sSXu4NEHWJpSeiultAV4CBhWo8ypwHP56eer10fEqUCLlNIzACmlTSmlDw9MtVWH/b6eNVwOPOX1PCgUc00T0JpcUDgMaAmsafAaa0+KuaanAr9JKW1LKf038Edg8AGos3YjpfQbcn/41mUY8EDK+R1QEhHHABcDz6SU1qeUNpAL6Af99TTEHTy6AG8XzK/ILyv0InBZfvpS4MiI6ACcDGyMiMci4j8j4rsR0bzBa6zdKeZ6FroK+HmD1FD7ar+vaUppAbkAsDr/eTql9GoD11d7Vszv6YvA4Ig4IiI6AucDxzVwfVW8uq753vxbOOgY4rLlJuC8iPhP4DxgJbCdXHP+ufn1Z5PrGhjZSHXU3qvregKQ/+vwE8DTjVM97Ydar2lE9ABOAbqS+x/DwIg4t/GqqX1Q6zVNKf0a+HdgPrk/tBZQ8PsrHQiGuIPHSnb9K65rftkOKaVVKaXLUkqfAibkl20k9xfDonyXwDbgcXJjAtR4irme1a4AfpVS2trQldVeKeaaXgr8Lj/UYRO5MTl9D0y1tRtF/Z6mlG5PKfVOKV1IbozVGwem2ipCXdd8j/8WDkaGuIPHC8BJEXFCRLQi1402q7BARHSMiOprdivw04JtSyKiU35+IPDKAaiz6lbM9az2BexKPZgUc02Xk2vNaRERLcm16Nid2vj2+5pGRPPq4Q8RcQZwBvDrA1Zz7a9ZwJfyd6meA7yXUlpNrsfjoohoFxHtgIvIQC+IIe4gkW9B+xq5fzSvAo+klF6OiNsiYmi+2ADg9Yh4A+gM3J7fdju5Jv85EbGY3F+E9x7gU1CBYq4nQP4RMccBcw9gtbUbRV7TX5C7C3IxubFUL6aU/v8DWX99VJHXtCXw24h4hdwrnP6//P7UiCKiumu7Z0SsiIhREXF9RFyfL/Lv5O48XUru/5NfBUgprQf+iVywfwG4Lb/soOZrtyRJkjLIljhJkqQMMsRJkiRlkCFOkiQpgwxxkiRJGWSIkyRJyiBDnCRJUgYZ4iRJkjLo/wEwXjN7Uvp4dwAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# -----------------------------\n", + "# Run This Cell to Produce Your Plot\n", + "# ------------------------------\n", + "reuters_corpus = read_corpus()\n", + "M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)\n", + "M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)\n", + "# Rescale (normalize) the rows to make them each of unit-length\n", + "M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)\n", + "\n", + "M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting\n", + "\n", + "\n", + "words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']\n", + "plot_embeddings(M_normalized, word2Ind_co_occurrence, words)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer\n", + "In the 2 dimensional representation, the country names form an immediate cluster.
The other clusters observed is of oil and energy and petroleum and industry which make intutive sense.
\n", + "Words like barrels and bpd should have been in closer proximity as they are corelated but you would use one or the other in a sentence/context since they express the same meaning.
\n", + "\n", + "国家名称聚类\n", + "\n", + "石油能源行业聚类" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 2: Prediction-Based Word Vectors (15 points)\n", + "\n", + "As discussed in class, more recently prediction-based word vectors have come into fashion, e.g. word2vec. Here, we shall explore the embeddings produced by word2vec. Please revisit the class notes and lecture slides for more details on the word2vec algorithm. If you're feeling adventurous, challenge yourself and try reading the [original paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).\n", + "\n", + "First make sure that you have downloaded the word2vec embeddings from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit\n", + "\n", + "Then run the following cells to load the word2vec vectors into memory. **Note**: This might take several minutes." + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": {}, + "outputs": [], + "source": [ + "# Fill this variable with the path to your downloaded and unzipped embeddings (`GoogleNews-vectors-negative300.bin` file).\n", + "#\n", + "# For Windows users place the `GoogleNews-vectors-negative300.bin` file in your conda environment's installation of gensim:\n", + "# `envs/{conda_env_name}/lib/site-packages/gensim/test/test_data`\n", + "# \n", + "# For Mac/Linux users, you can place the `GoogleNews-vectors-negative300.bin` file anywhere on your machine.\n", + "# \n", + "\n", + "embeddings_fp = \"GoogleNews-vectors-negative300.bin.gz\"" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": {}, + "outputs": [], + "source": [ + "def load_word2vec(embeddings_fp=embeddings_fp):\n", + " \"\"\" Load Word2Vec Vectors\n", + " Param:\n", + " embeddings_fp (string) - path to .bin file of pretrained word vectors\n", + " Return:\n", + " wv_from_bin: All 3 million embeddings, each lengh 300\n", + " This is the KeyedVectors format: https://radimrehurek.com/gensim/models/deprecated/keyedvectors.html\n", + " \"\"\"\n", + " embed_size = 300\n", + " print(\"Loading 3 million word vectors from file...\")\n", + " wv_from_bin = KeyedVectors.load_word2vec_format(datapath(embeddings_fp), binary=True)\n", + " vocab = list(wv_from_bin.vocab.keys())\n", + " print(\"Loaded vocab size %i\" % len(vocab))\n", + " return wv_from_bin" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading 3 million word vectors from file...\n", + "Loaded vocab size 3000000\n" + ] + } + ], + "source": [ + "# -----------------------------------\n", + "# Run Cell to Load Word Vectors\n", + "# Note: This may take several minutes\n", + "# -----------------------------------\n", + "wv_from_bin = load_word2vec()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Reducing dimensionality of Word2Vec Word Embeddings\n", + "Let's directly compare the word2vec embeddings to those of the co-occurrence matrix. Run the following cells to:\n", + "\n", + "1. Put the 3 million word2vec vectors into a matrix M\n", + "2. Run reduce_to_k_dim (your Truncated SVD function) to reduce the vectors from 300-dimensional to 2-dimensional." + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "metadata": {}, + "outputs": [], + "source": [ + "def get_matrix_of_vectors(wv_from_bin):\n", + " \"\"\" Put the word2vec vectors into a matrix M.\n", + " Param:\n", + " wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file\n", + " Return:\n", + " M: numpy matrix shape (num words, 300) containing the vectors\n", + " word2Ind: dictionary mapping each word to its row number in M\n", + " \"\"\"\n", + " words = list(wv_from_bin.vocab.keys())\n", + " print(\"Putting %i words into word2Ind and matrix M...\" % len(words))\n", + " word2Ind = {}\n", + " M = []\n", + " curInd = 0\n", + " for w in words:\n", + " try:\n", + " M.append(wv_from_bin.word_vec(w))\n", + " word2Ind[w] = curInd\n", + " curInd += 1\n", + " except KeyError:\n", + " continue\n", + " M = np.stack(M)\n", + " print(\"Done.\")\n", + " return M, word2Ind" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Putting 3000000 words into word2Ind and matrix M...\n", + "Done.\n", + "Running Truncated SVD over 3000000 words...\n", + "Done.\n" + ] + } + ], + "source": [ + "# -----------------------------------------------------------------\n", + "# Run Cell to Reduce 300-Dimensinal Word Embeddings to k Dimensions\n", + "# Note: This may take several minutes\n", + "# -----------------------------------------------------------------\n", + "M, word2Ind = get_matrix_of_vectors(wv_from_bin)\n", + "M_reduced = reduce_to_k_dim(M, k=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2.1: Word2Vec Plot Analysis [written] (4 points)\n", + "\n", + "Run the cell below to plot the 2D word2vec embeddings for `['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']`.\n", + "\n", + "What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have? How is the plot different from the one generated earlier from the co-occurrence matrix?" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAl8AAAEyCAYAAADEPbUEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3X+YlVW9///nm0FETR0F9CSIqIM/AA11QuFMOhAaasIXE5FjJlwOWn40Ik8JFx0OIYUey3FKPxrhj84xMk1TUoM8yJB8MWP46pVJoUhIoIEpZh47wDDr+8f8cDOADs6ee37wfFwX1+x777Xvte51DXte973WXneklJAkSVI2OrV2AyRJkvYmhi9JkqQMGb4kSZIyZPiSJEnKkOFLkiQpQ4YvSZKkDBm+JEmSMmT4kiRJypDhS5IkKUOdW7sBu9O9e/fUp0+f1m6GJEnSh1qxYsVfU0o9mlK2zYavPn36UFVV1drNkCRJ+lAR8WpTyzrsKEmSlCHDlyRJUoYMX5IkSRkyfEmSJGXI8CU10/jx41m6dGlrN0OS1E4YviRJkjJk+FK7cd111/Hoo48C8I9//IOBAweyZMkSzjrrLEpLS/niF79ISom1a9dy2mmn8fnPf55TTz2VW2+9FYC//e1vXHzxxXz6059m2LBhrF69mg0bNlBaWkppaSn9+/fnc5/7HGvXrmX48OEN9RYVFQGwbds2ysrKGDp0KCUlJfz2t7/dqY2f+cxnKC0tZdCgQTzzzDMZ9Iokqb0xfKnd+MIXvsB//ud/AvDoo49ywQUX8JWvfIX58+dTWVnJfvvtx+OPPw7A66+/zpw5c1i2bBkVFRUAzJ49mwsvvJBFixZRXl7OlClT6NmzJ5WVlTz66KN069aNmTNn7rb+u+66i6KiIhYvXsxDDz3E5MmTdyrz8MMPU1lZyY9+9COmTZvWAr0gSWrv2uwiqxIpQUTD5idOPpn169ezefNm7rvvPv793/+d2267jVGjRgHw7rvvcvzxxzNgwABOPPFE9t9/fwAKCgoAeOGFF1iyZAl33nknAJ071/76b9myhbFjxzJr1iz69+/Pq6/uuE5eSqnh/cuWLWPBggVA7ZW0XP/4xz+YNGkSq1atoqCggA0bNuS7RyRJHYDhS23TjBnw9ttQXl4bwFKCyZMZW1hIRUUF7777LsXFxRxzzDE89thjfOxjHwNqhwY3bNhA5IS2ev3792fw4MGMHj0agK1bt5JSYsKECZSVlXHmmWcCcMghh/Daa6+RUmLjxo0NIap///4UFRU1XPHaunXrDvtfsGABBQUFPP3006xcuZKRI0e2VO9Iktoxw5fanpRqg1fdcCHl5TB5MlRUcGlZGUfdeCMVFRVEBLfccgsjR44kpUSnTp0oLy/noIMO2uVup02bxhe/+EW+//3vk1Li/PPP5/TTT+fxxx/ntdde47bbbqOkpIRZs2YxYsQIBg8ezKBBgzj88MMBmDhxItdeey1Dhw4FoLi4mJtvvrlh/4MHD2b27NkMHz6cf/7nf27ZPpIktVtRP6TSrJ1EjAAqgAJgbkrpxkavjwduBurHYW5LKc39oH0WFxcn7+24F6u70tUQwAAmTXr/SpgkSW1IRKxIKRU3qWxzw1dEFAAvAWcD64HlwLiU0sqcMuOB4pTSNU3dr+FLpASdcr4TUlNj8JIktUl7Er7y8W3HQcDqlNKalNJW4H5gVB72q71Z/ZWvXJMn1z4vSVI7lo/w1RP4c872+rrnGvtcRPwuIn4WEUfmoV51VLlDjpMm1V7xmjSpdtsAJklq57KacP8L4CcppS0RcRXwI2BY40IRcSVwJUDv3r0zapranAgoLNxxjld5ee1rhYUOPUqS2rV8zPkaDMxIKX2mbnsqQEpp9m7KFwBvpZQO/qD9OudLjdf52mlbkqQ2Ius5X8uBvhFxdER0AS4B5jdq0MdzNkcCf8hDveroGgctg5ckqQNo9rBjSqk6Iq4BFlK71MTdKaUXI2ImUJVSmg98OSJGAtXAW8D45tYrSZLUHuVlna+W4LCjJElqL7IedpQkSVITGb4kSZIyZPiSJEnKkOFLkiQpQ4YvSZKkDBm+JEmSMmT4kiRJypDhS5IkKUOGL0mSpAwZviRJkjJk+JIkScqQ4UuSJClDhi9JkqQMGb4kSZIyZPiSJEnKkOFLkiQpQ4YvSZKkDBm+JEmSMmT4kiRJypDhS5IkKUOGL0mSpAwZviRJkjJk+JIkScqQ4UuSJClDhi9JkqQMGb4kSZIyZPiSJEnKkOFLkiQpQ4YvSZKkDOUlfEXEiIhYFRGrI2LKB5T7XESkiCjOR72SJEntTbPDV0QUALcD5wL9gHER0W8X5Q4EJgHPNrdOSZKk9iofV74GAatTSmtSSluB+4FRuyh3A3AT8L95qFOSJKldykf46gn8OWd7fd1zDSLiVODIlNLjeahPkiSp3WrxCfcR0Qm4BbiuCWWvjIiqiKh64403WrppkiRJmctH+NoAHJmz3avuuXoHAgOAyohYC5wBzN/VpPuU0pyUUnFKqbhHjx55aJokSVLbko/wtRzoGxFHR0QX4BJgfv2LKaW/pZS6p5T6pJT6AL8BRqaUqvJQtyRJUrvS7PCVUqoGrgEWAn8AHkgpvRgRMyNiZHP3L0mS1JF0zsdOUkpPAE80em76bsqW5qNOSZKk9sgV7iVJkjJk+JIkScqQ4UuSJClDhi9JkqQMGb4kSZIyZPiSJEnKkOFLkiQpQ4YvSZKkDBm+JEmSMmT4kiRJypDhS5IkKUOGL0mSpAwZviRJkjJk+JIkScqQ4UuSJClDhi9JkqQMGb4kSZIyZPjK0Pr16yktLW3tZkiSpFZk+Gqjtm/f3tpNkCRJLcDw1cjUqVM566yzGDx4MI899hjr1q1jxIgRnHXWWQwfPpyamhrGjx/P0qVLAbjvvvuYMWMGANdffz1Dhw7l1FNPZc6cOQC8++67nH/++QwfPpxvf/vbDfW89NJLlJaWctZZZzF27Fj+8Y9/AHDUUUdx9dVXM2rUqGwPXJIkZaJzazegLVmwYAGbN29myZIlvPfeewwePJjjjjuOyZMn85nPfIaamho6ddp9Xp0+fToHHHAAW7Zs4aSTTmLChAn88Ic/pKSkhKlTp/LjH/+YlStXAvD1r3+dmTNncuaZZzJz5kx++MMf8uUvf5nXX3+dKVOm0Lt376wOW5IkZcgrXzleeOEFlixZQmlpKeeddx5btmxh5cqVDBs2DKAheEVEw3tSSg2P77jjDkpKSjjnnHPYtGkTmzZt4qWXXmLQoEEAnH766Q1lX3rpJYYMGQLAkCFD+OMf/whAz549DV6SJHVge3f4yglOAP379eOcc86hsrKSyspKfve739G/f38qKysBqKmpAeDQQw9l/fr1AKxYsQKAzZs3c88997BkyRIWLlzIwQcfTEqJvn37UlVVBcDy5csb6jruuONYtmwZAMuWLeP4448HoKCgoOWOV5Iktbq9d9hxxgx4+20oL4cISInznnySZc89R2lpKRFBr169+M53vsPEiROZNWsW++yzD7/61a8oKytj3LhxzJs3j+7du1NYWEhhYSH9+vWjpKSEE088kW7dugEwceJELr74Yp588kkGDBjQUP2NN97IVVddRUqJww47jP/6r/9qpY6QJElZitTo6k9bUVxcnOqvGOVdSjB5MlRUwKRJtQGs8XbO0KIkSdIHiYgVKaXippTdO698RdQGLKgNXBUVtY8NXpIkqYXtnVe+6qUEud9erKkxeEmSpD22J1e+9t4J9/VDj7kmT95pEr4kSVI+7Z3hq/Gcr5qa2p8VFQYwSZLUovbeOV+FhTvO8aqfA1ZY6NCjJElqMXmZ8xURI4AKoACYm1K6sdHrXwT+D7AdeBe4MqW08oP2mdmcr9yg1XhbkiSpCTKd8xURBcDtwLlAP2BcRPRrVGxeSumklNJA4D+AW5pbb140DloGL0mS1MLyMedrELA6pbQmpbQVuB/Y4a7QKaV3cjYPAJxUJUmS9kr5mPPVE/hzzvZ64PTGhSLi/wBfBboAw3a1o4i4ErgS8P6GkiSpQ8rs244ppdtTSscC1wPf2E2ZOSml4pRScY8ePbJqmiRJUmbyEb42AEfmbPeqe2537gf+nzzUK0mS1O7kI3wtB/pGxNER0QW4BJifWyAi+uZsng+8nId6JUmS2p1mz/lKKVVHxDXAQmqXmrg7pfRiRMwEqlJK84FrImI4sA3YDFze3HolSZLao7wssppSegJ4otFz03MeT8pHPZIkSe3d3nl7IUmSpFZi+JIkScqQ4UuSJClDhi9JkqQMGb4kSZIyZPiSJEnKkOFLkiQpQ4YvSZKkDBm+JEmSMmT4kiRJypDhS5IkKUOGL0mSpAwZviRJkjJk+JI6uO9973sf+b333nsv77zzTh5bI0kyfEkdnOFLktoWw5fUDqWUuOqqqygpKWHIkCH89re/pbS0lPXr1wMwa9Ys7r33XubNm8eGDRsoLS3lW9/6FpWVlQwdOpTRo0czcOBAHnzwQQDGjx/P0qVLAbjvvvuYMWMGTz31FM8//zxjxozh2muvbbVjlaSOpnNrN0DSnnv00UfZtm0bS5cuZc2aNVxyySXsv//+O5X7l3/5F6ZPn05lZSUAlZWVvPHGGzz55JO89957FBcX87nPfW6XdQwbNoyBAwdy33330atXr5Y8HEnaq3jlS2rrUtppe9WqVQwZMgSAY445hs2bNxMROUUavSfHKaecQufOnTnooIM47LDDeOONN5r8XklS8xm+pLZsxgyYPPn9AJYSTJ7M8c89x7JlywBYs2YNhYWFHHrooQ3DjitWrGjYRefOnampqWnYfv7556murubvf/87GzdupEePHrt9b5cuXaiurm7hg5SkvYvDjlJblRK8/TZUVNRul5fXBrGKCkZ++cs8/j//Q0lJCdu3b+f73/8+W7ZsoaysjOOOO4599923YTcXXXQR559/Pueeey4nn3wyRxxxBGPGjOFPf/oTs2bNolOnTpSVlTFu3DjmzZtH9+7dKSwsBODCCy/kiiuuYMiQIdxwww2t0QuS1OFEWx1iKC4uTlVVVa3dDKl11V3paghgAJMm1QaxnKHCpqqsrOS+++5j7ty5eWykJCkiVqSUiptS1mFHqS2LqA1auT5i8JIktQ2GL6ktq7/ylSt3DtgeKi0t9aqXJLUyw5fUVuUOOU6aBDU1tT8rKpoVwCRJrcsJ91JbFQGFhTvO8aofgiwsdOhRktopJ9xLbV1KOwatxtuSpFbnhHupI2kctAxektSuGb4kSZIyZPiSJEnKUF7CV0SMiIhVEbE6Iqbs4vWvRsTKiPhdRCyKiKPyUa8kSVJ70+zwFREFwO3AuUA/YFxE9GtU7DmgOKV0MvAz4D+aW68kSVJ7lI8rX4OA1SmlNSmlrcD9wKjcAimlxSml9+o2fwP0ykO9kiRJ7U4+wldP4M852+vrntudK4Bf7uqFiLgyIqoiouqNN97IQ9MkSZLalkwn3EfE54Fi4OZdvZ5SmpNSKk4pFffo0SPLpkmSJGUiHyvcbwCOzNnuVffcDiJiODANOCultCUP9UqSJLU7+bjytRzoGxFHR0QX4BJgfm6BiDgF+AEwMqW0KQ91SpIktUvNDl8ppWrgGmAh8AfggZTSixExMyJG1hW7GfgY8GBEPB8R83ezO0mSpA4tLzfWTik9ATzR6LnpOY+H56MeSZKk9s4V7iVJkjJk+JIkScqQ4UuSJClDhi9JkqQMGb4kSZIyZPiSJEnKkOFLkiQpQ4YvSZKkDBm+JEmSMmT4kiRJypDhS5IkKUOGL0mSpAwZviRJkjJk+JIkScqQ4UuSJClDhi9JkqQMGb4kSZIyZPiSJEnKkOFLkiQpQ4YvSZKkDBm+JEmSMmT4kiRJypDhS5IkKUOGL0mSpAwZviRJkjJk+JIkScqQ4UuSJClDhi9JkqQMGb4kSZIylJfwFREjImJVRKyOiCm7eP3MiPj/IqI6Ii7KR52SJEntUbPDV0QUALcD5wL9gHER0a9RsXXAeGBec+uTJElqzzrnYR+DgNUppTUAEXE/MApYWV8gpbS27rWaPNQnSZLUbuVj2LEn8Oec7fV1z+2xiLgyIqoiouqNN97IQ9MkSZLaljY14T6lNCelVJxSKu7Ro0drN0eSJCnv8hG+NgBH5mz3qntOkiRJjeQjfC0H+kbE0RHRBbgEmJ+H/UqSJHU4zQ5fKaVq4BpgIfAH4IGU0osRMTMiRgJExCcjYj0wBvhBRLzY3HolSZLao3x825GU0hPAE42em57zeDm1w5GSJEl7tTY14V6SJKmjM3xJkiRlyPAlSZKUIcOXJElShgxfkiRJGTJ8SZIkZcjwJUmSlCHDlyRJUoYMX5IkSRkyfEmSJGXI8CVJe2Dt2rUMHz68Rfb9/PPPc/PNNwPwyCOPsG7duhapR1Lrysu9HSVJzTdw4EAGDhwI1Iav7t2707t371ZulaR888qXJH1Et912G1/60pc4+uijG54bPnw4a9eu5Zvf/CY///nPSSlx2GGH8ctf/pLt27dTXFwMwPXXX8/QoUM59dRTmTNnDgCVlZWUlZWxcuVKFixYwLXXXsuYMWNa5dgktRyvfEnSRzB16lS6du3KHXfcQVFR0U6vDxs2jAceeIBjjjmGwYMH89RTT3HooYdy2mmnATB9+nQOOOAAtmzZwkknncSECRMa3tuvXz9GjBhBWVkZJSUlmR2TpGwYviRpd1KCiB23gRdffJHNmzfzzDPP7OIttWXOOOMMrrvuOo499liuueYaKioqWLx4McOGDQPgjjvu4JFHHqGgoIBNmzaxadOmlj8eSW2Cw46StCszZsDkyQ2Bi5Rqt2+9lf79+zN16lQuvvhitmzZQk1NDVu2bOG9997jD3/4AwD77LMP3bp146GHHqKkpIRu3brx8MMPM3ToUDZv3sw999zDkiVLWLhwIQcffHBDaKvXpUsXqqurMz5oSVnwypckNZYSvP02VFTUbpeX1wavigqoGx686KKL6NKlCxdddBFXXHEFZ5xxBgMHDqRXr14Nuxk2bBiPPfYY++23H6WlpaxYsYLDDjuMlBL9+vWjpKSEE088kW7duu3UhM9+9rNMnz6dE088kR/84AeZHLakbETjs622ori4OFVVVbV2MyTtreqvdNUHMIBJk2qDWO5QpCQBEbEipVTcpLKGL0najZSgU87sjJoag5ekXdqT8OWcL0nalforX7ly54BJ0kdk+JKkxnKHHCdNqr3iNWlS7bYBTFIzOeFekhqLgMLCHed4lZfXvlZY6NCjpGZxzpck7c6u1vkyeEnaBed8SVI+NA5aBi9JeWD4kiRJypDhSx/qL3/5C9ddd12TypaVlVFZWblH+3/kkUdYt27dR2iZJEntj+FLH+qf/umf+O53v9ti+99d+Nq+fXuL1SlJUmsxfOlDrV27luHDhzNjxgwuvfRSRo4cycCBA/njH/8IwIMPPsjAgQMZPXo0r7zyyg7vqVdUVARAZWUlgwYNYujQoUyYMIGVK1eyYMECrr32WsaMGQPAUUcdxdVXX82oUaMYO3Yszz33HACvvvoqZ599dpaHLklS3rnUhPZIjx49+PGPf8y8efOYO3cuN910E9OmTWPFihV07dqVT3ziEx/4/ocffphZs2ZxzjnnUFNTQ6dOnRgxYgRlZWWUlJQA8PrrrzNlyhR69+7NokWLuOuuu7jtttu45557uOKKK7I4TEmSWkxernxFxIiIWBURqyNiyi5e3zciflr3+rMR0Scf9Sp7p512GgC9e/fmzTff5K9//SuHH344Bx54IPvssw+nnnoqANHoW2H1S5p87WtfY/78+Vx66aXcc889u6yjZ8+e9O7dG6i9MfGzzz7Le++9xy9+8QtGjx7dUocmSVImmn3lKyIKgNuBs4H1wPKImJ9SWplT7Apgc0qpKCIuAW4Cxja3brWAXa1rlCM3VKWU6N69Oxs3buTdd9+la9euPP/88wAccsghvPbaa6SU2LhxIxs2bACgW7du3HbbbaSUOO644xgzZgxdunShurq6Yb8FBQU71HfRRRdx9dVXc+aZZ7Lvvvu2xFFLkpSZfAw7DgJWp5TWAETE/cAoIDd8jQJm1D3+GXBbRERqqyu87q1mzIC3335/Re9d3duukYKCAmbOnElJSQlHH300PXv2BOCggw5ixIgRDB48mEGDBnH44YcDcMstt/CrX/2Kmpoazj77bA466CA++9nPMn36dE488UR+8IMf7FTHhAkT6NWrV8PcL0mS2rNmr3AfERcBI1JKZXXblwGnp5SuySnz+7oy6+u2X6kr89dG+7oSuBKgd+/ep7366qvNapv2QON72ZWX77zdSgtMbty4kXHjxvHUU0+1Sv2SJH2YPVnhvk1NuE8pzQHmQO3thVq5OXuX3HvXVVTU/oNWD15PPvkk3/jGN5g9e3ar1C9JUr7lY8L9BuDInO1edc/tskxEdAYOBt7MQ93Kp9wAVq8VgxfA2WefzbPPPsuwYcNarQ2SJOVTPsLXcqBvRBwdEV2AS4D5jcrMBy6ve3wR8JTzvdqgXc3xmjx5p0n3kiTpo2t2+EopVQPXAAuBPwAPpJRejIiZETGyrthdQLeIWA18FdhpOQq1ssZzvmpqan9WVBjAJEnKo7zM+UopPQE80ei56TmP/xcYk4+61EIioLBwxzle9UOQhYWtOvQoSVJH0uxvO7aU4uLiVFVV1drN2Pvsap0vg5ckSR9oT77t6L0dtaPGQcvgJUlSXhm+lKnt27e3dhMkSWpVbWqdL7VdU6dOZdmyZWzdupVp06ZRVVXFyy+/zN///nfWrVvH/fffzwknnMCSJUuYPn06EcEJJ5zAHXfcwauvvsqYMWM44YQT2GeffbjhhhsYN24c+++/P0cddRRbtmyhvLycc889l9/85jcA3HDDDfTp04fLLruslY9ckqT88sqXPtSCBQvYvHkzS5YsYdGiRUybNo2UEj169GD+/Pl8/etfZ+7cuaSU+MpXvsL8+fOprKxkv/324/HHHwdg7dq13H777dx9993cdNNNXH311SxYsKDhBtqHHHIIffv2paqqipQSjzzyCBdddFFrHrYkSS3CK1/a0S4m3L/wwgssWbKE0tJSALZs2cKbb77J6aefDkDv3r158skn+etf/8ratWsZNWoUAO+++y7HH388AwYMYMCAARx00EEAvPzyy0yaNAmA008/nZdffhmAK6+8krlz5/LOO+8wePBg9ttvv4wOWpKk7Bi+9L7d3Fi7/xtvcM4551BRd8uhrVu38u1vf5vICWkpJbp3784xxxzDY489xsc+9jEAtm3bxoYNGygoKGgoW1RURFVVFcceeyzLly9veP5Tn/oUX/va19i4cSMzZszI5JAlScqa4Uu1UqoNXvX3dMy5sfZ5kyax7IADKC0tJSLo1asXxx577E67iAhuueUWRo4cSUqJTp06UV5e3nDFq97111/PuHHjuPvuuzniiCPo0qVLw2tjx45l3rx5fOITn2jRw5UkqbW4zpfel7vKfb0WuLH29u3b6dSpExHBt771Lfbdd1/+9V//FYBbb72VAw44gIkTJ+atPkmSWtqerPNl+NKOUoJOOd/DqKnJ+1pfr732GmPHjiWlxIEHHsj999/PwQcfzPXXX8/y5ct5/PHHne8lSWpX9iR8Oeyo9+3uxtp5vvJ1xBFH8PTTT+/0/E033ZS3OiRJaqtcakK1vLG2JEmZ8MqXanljbUmSMuGcL+3IG2tLkrTHvLG2PjpvrC1JUosyfEmSJGXI8CVJkpQhw5ckSVKGDF+SJEkZMnxJkiRlyPAlSZKUIcOXJElShgxfkiRJGTJ8SZIkZcjwJUmSlCHDlyRJUoYMX5IkSRkyfEmSOoS1a9cyfPjwFtt/aWkp69evb7H9a+9h+JLasRtvvJEXXngBgKKiolZujdT+1NTU7LC9ffv2VmqJ9iadm/PmiDgU+CnQB1gLXJxS2ryLcguAM4ClKaXPNqdOSe+bMmVKazdBalPeeustxo4dyyuvvMJll13GySefzMyZM6murubQQw/lpz/9KV27dqWoqIiLL76YZ555httvv53LL7+cE044gX322Yfy8nImTpzIm2++SUqJOXPm7HBy8+KLL1JWVkbXrl3p2rUrv/zlL1vxiNUeNffK1xRgUUqpL7CobntXbgYua2Zd0l4tpcRVV11FSUkJQ4YM4be//S3jx49n6dKlrd00qc3485//zNy5c3nmmWe45557OOaYY1i8eDFPP/00J5xwAg888AAA1dXVXHDBBSxevJj999+ftWvXcvvtt3P33Xcze/ZsLrzwQhYtWkR5eflOJzkLFy5kwoQJLF68mMcff7w1DlPtXLOufAGjgNK6xz8CKoHrGxdKKS2KiNLGz0tqukcffZRt27axdOlS1qxZwyWXXEK/fv1au1lS9lKCiF1un3DCCRx44IEADBgwgL/85S9MnDiRLVu2sHHjRg466CAACgoKOOOMMxp2MWDAgIbXXnjhBZYsWcKdd94JQOfOO/6pnDBhAt/61re49NJLOfnkk7n++p3+7EkfqLnh6/CU0ut1j/8CHN6cnUXElcCVAL17925m06SOZdWqVQwZMgSAY445hs2bdxrhlzq+GTPg7behvLw2cKUEkydDYSGMH88f//hH3n33Xbp27crvf/97ZsyYwTe/+U0GDx7M17/+dVJKAEQEkRPgCgoKGh7379+fwYMHM3r0aAC2bt26QxP23XdfvvOd7wAwfPhwzjvvPE466aQWPnB1JB867BgR/x0Rv9/Fv1G55VLtb3RqTmNSSnNSSsUppeIePXo0Z1dS+5d2/O90/HHHsWzZMgDWrFlDYWFha7RKaj0p1QaviorawFUfvCoqap9PiT59+jBx4kTOOOMMLr/8cr7whS9wxRVXMHr0aDZt2tSkaqZNm8YDDzzAsGHDGDp0KN/73vd2eP0nP/kJn/rUpzjzzDM59NBDOf7441viaNWBRUofPS9FxCqgNKX0ekR8HKhMKe3yt7Bu2PFfmzrhvri4OFVVVX3ktknt2i7O7mu+8hWuevpp/rD//mzfvp3y8nLuvPNOysrKKCkpoajNzadmAAAJ70lEQVSoiNWrV7d2y6WWlRu46k2a9P7/FelDrF27lrKyMv77v/97j943a9YsevXqxfjx43f5ekSsSCkVN2VfzR12nA9cDtxY9/PRZu5PUu7ZPdT+UZk8mU7f+x4/bPRHJnfOisFLe4WI2v8DueHL4KV2prnfdrwRODsiXgaG120TEcURMbe+UEQ8DTwIfDoi1kfEZ5pZr9Rx1f9xmTSp9g9Mp061P9v42f29997LO++8s0fvcW0y7bH6K1+56ocgpSaqX5KkuLiYiooK7r33Xi644AIuuOACTjnlFJ5++mkAfv3rX3PKKadwwQUX8Oyzz+at/maFr5TSmymlT6eU+qaUhqeU3qp7viqlVJZT7lMppR4ppf1SSr1SSgub23CpQ6sPYLnacPCC3YcvF61U3uQOOU6aBDU175+kGMC0BxovSbJp0ya2bdvGL37xC37+858zuS7gf/WrX+XRRx9l/vz5bNmyJW/1N3fYUVJL2N3ZfcYBbO3atVx44YX07du3YdHK8ePH77QA5bp163j++ecZM2YMxcXFXHfddYwZM6Zh0crZs2czfvx43nvvPQ444AB+9KMfkfulmm3btvGlL32JV155hW3btnHLLbcwaNAgSktLue++++jVq9cO8y2KiooYPXo0S5cu5bTTTuPjH/84Cxcu5JBDDuGRRx7Z4Vts6kAiar/VmHsVuP4kpbCwTZ+cqJXsalkSdl6SJKXEJz/5SQD69OnD3/72NwDeeeedhtUXBg0alLdmeXshqa1pY2f3jc8QJ0+evNMClMOGDWPgwIE8+OCDfP/73wfYadHKcePGsWTJEi655BJmz569Qx133XUXRUVFLF68mIceeqjhrHN3qqurueyyy3jmmWdYtGgRJ554Ir/+9a+JCJ5//vkW6wu1ATNm7HgSUh/AZsxozVapLZoxY8fPzPrP1ltvbViSpLq6mt///vdEBCtWrABg3bp1DWu+HXjggQ3381y+fHnemuaVL6mtaa2z+yaeIb7++utUVFTsdgHKermLVq5atYprrrkGgCFDhnD//ffvUPaFF15g2bJlLFiwAKDhrDP3ClbuN7M7d+7MySefDEDPnj055ZRTAOjVqxdvvfXWRzh4tSuN/w94xUuN7eaLS1RUwIQJDUuSvPzyy1x++eUccsgh7L///px//vm89tprlNd95n73u9/lggsu4Igjjmj4HMwHw5fUFs2YsWMYqg9gLfVHZncLV8JOi1aeeuqpXHnllTstQNmlSxeqq6sbdpm7aOXxxx/PsmXLKCoqYtmyZTuti9S/f3+KiooarnjV7/PQQw9l/fr19OrVixUrVnDkkUfusvm7C2mS9lK5J60VFe+HsEmT6FNezvJGn6X33nsvAwcO5Bvf+MYOz5eWlvLcc8/lvXkOO0ptVVZn9x+0cOU77+y0aGV5efkuF6C88MILueKKK/i3f/u3naqYMmUKP/7xjznzzDOZN28eU6dO3eH1iRMnsmrVKoYOHcrQoUOZNm0aAF/+8pcpKyvjwgsvZN99922Z45fUMbXhLy41a5HVluQiq1KGdrNw5dpJkyibOHGPFyOUpFaX8YK8e7LIqle+JLXpM0RJ2mNt7ItLjTnnS9Jul7boU17uVS9J7U8bX5bE8CXt7RqfIeZ+Kwi8Aiapfcr6i0t7wPAl7e3a+BmiJH1kbXRZEifcS6q1q3W+2sgHlSS1dU64l7Tn2ugZoiR1NIYvSZKkDBm+JEmSMmT4kiRJypDhS5IkKUOGL0mSpAwZviRJkjJk+JIkScqQ4UuSJClDhi9JkqQMtdnbC0XEG8D/AH9t7bZ0YN2xf1uKfdty7NuWY9+2HPu2ZbWF/j0qpdSjKQXbbPgCiIiqpt4nSXvO/m059m3LsW9bjn3bcuzbltXe+tdhR0mSpAwZviRJkjLU1sPXnNZuQAdn/7Yc+7bl2Lctx75tOfZty2pX/dum53xJkiR1NG39ypckSVKHYviSJEnKUJsIXxExIiJWRcTqiJiyi9e/GhErI+J3EbEoIo5qjXa2Rx/WtznlPhcRKSLazVd1W1tT+jYiLq773X0xIuZl3cb2rAmfC70jYnFEPFf32XBea7SzvYmIuyNiU0T8fjevR0R8r67ffxcRp2bdxvasCf17aV2/vhARyyLiE1m3sb36sL7NKffJiKiOiIuyatueavXwFREFwO3AuUA/YFxE9GtU7DmgOKV0MvAz4D+ybWX71MS+JSIOBCYBz2bbwvarKX0bEX2BqcA/p5T6A1/JvKHtVBN/d78BPJBSOgW4BPi/2bay3boXGPEBr58L9K37dyVwRwZt6kju5YP790/AWSmlk4AbaGcTxVvZvXxw39Z/dtwE/CqLBn1UrR6+gEHA6pTSmpTSVuB+YFRugZTS4pTSe3WbvwF6ZdzG9upD+7bODdT+sv5vlo1r55rStxOB21NKmwFSSpsybmN71pT+TcBBdY8PBl7LsH3tVkrp18BbH1BkFPCfqdZvgMKI+Hg2rWv/Pqx/U0rL6j8T8O/ZHmnC7y7AtcBDQJv+vG0L4asn8Oec7fV1z+3OFcAvW7RFHceH9m3dkMKRKaXHs2xYB9CU39vjgOMi4v+NiN9ExAeesWkHTenfGcDnI2I98AS1H7pqvj39TNZH59+zPIqInsBo2sHV2s6t3YA9ERGfB4qBs1q7LR1BRHQCbgHGt3JTOqrO1A7dlFJ7dvvriDgppfR2q7aq4xgH3JtS+m5EDAb+KyIGpJRqWrth0oeJiKHUhq+S1m5LB3IrcH1KqSYiWrstH6gthK8NwJE5273qnttBRAwHplE7Vr4lo7a1dx/WtwcCA4DKul/UfwLmR8TIlFJVZq1sn5rye7seeDaltA34U0S8RG0YW55NE9u1pvTvFdTN/0gpPRMRXam9uW6bHm5oB5r0mayPLiJOBuYC56aU3mzt9nQgxcD9dX/PugPnRUR1SumR1m3WztrCsONyoG9EHB0RXaidODs/t0BEnAL8ABjpvJk98oF9m1L6W0qpe0qpT0qpD7XzDwxeTfOhv7fAI9Re9SIiulM7DLkmy0a2Y03p33XApwEi4kSgK/BGpq3smOYDX6j71uMZwN9SSq+3dqM6iojoDTwMXJZSeqm129ORpJSOzvl79jPg6rYYvKANXPlKKVVHxDXAQqAAuDul9GJEzASqUkrzgZuBjwEP1iXadSmlka3W6HaiiX2rj6CJfbsQOCciVgLbga95lts0Tezf64AfRsRkaiffj0/esuNDRcRPqD0p6F43X+7fgX0AUkp3Ujt/7jxgNfAeMKF1Wto+NaF/pwPdgP9b9/esOqXkEj9N0IS+bTe8vZAkSVKG2sKwoyRJ0l7D8CVJkpQhw5ckSVKGDF+SJEkZMnxJkiRlyPAlSZKUIcOXJElShv5/lHM7iof9VBAAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']\n", + "plot_embeddings(M_reduced, word2Ind, words)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer\n", + "\n", + "\n", + "\n", + "Keywords like industry,energy and oil,petroleum custers are different than what we had for the co-occurence based scheme, which tells us that Word2Vec captures different meanings/relationships than the prior scheme since Word2Vec is based on a predictive objective of what appears next.
\n", + "However, the country names do not form a close cluster as before.
\n", + "bpd and barrels are spread across with a relative horizonal distance of 0.3 here and <0.1 in the prior scheme. They are similar words but not close since one or the other would be used to signify the production output.
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Cosine Similarity\n", + "Now that we have word vectors, we need a way to quantify the similarity between individual words, according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are \"close\" and \"far\" from one another.\n", + "\n", + "We can think of n-dimensional vectors as points in n-dimensional space. If we take this perspective L1 and L2 Distances help quantify the amount of space \"we must travel\" to get between these two points. Another approach is to examine the angle between two vectors. From trigonometry we know that:\n", + "\n", + "\n", + "\n", + "Instead of computing the actual angle, we can leave the similarity in terms of $similarity = cos(\\Theta)$. Formally the [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) $s$ between two vectors $p$ and $q$ is defined as:\n", + "\n", + "$$s = \\frac{p \\cdot q}{||p|| ||q||}, \\textrm{ where } s \\in [-1, 1] $$ " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2.2: Polysemous Words (2 points) [code + written] \n", + "Find a [polysemous](https://en.wikipedia.org/wiki/Polysemy) word (for example, \"leaves\" or \"scoop\") such that the top-10 most similar words (according to cosine similarity) contains related words from *both* meanings. For example, \"leaves\" has both \"vanishes\" and \"stalks\" in the top 10, and \"scoop\" has both \"handed_waffle_cone\" and \"lowdown\". You will probably need to try several polysemous words before you find one. Please state the polysemous word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous words you tried didn't work?\n", + "\n", + "找到一个多义词(例如,“叶子”或者“勺子”) ,这样前10个最相似的词(根据余弦距离)就包含了两个意思相关的词。 例如,“叶子”在前10名中既有“消失”又有“茎” ,“勺子”既有“手握华夫饼圆锥体”又有“下端”。 在找到一个单词之前,你可能需要尝试几个多义词。 请说出你发现的多义词,以及在前10中出现的多义词。 为什么你认为你试过的许多多义词都不起作用?\n", + "\n", + "**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__." + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[('lies', 0.6923316717147827),\n", + " ('lying', 0.6025736331939697),\n", + " ('Lying', 0.5922049283981323),\n", + " ('Terravista_complex', 0.5337908267974854),\n", + " ('lay', 0.5245625376701355),\n", + " ('Lie', 0.49543923139572144),\n", + " ('perjure_yourself', 0.46970558166503906),\n", + " ('sit', 0.46696919202804565),\n", + " ('BE_TRUTHFUL_Do', 0.46605420112609863),\n", + " ('lurk', 0.45586875081062317)]" + ] + }, + "execution_count": 91, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# ------------------\n", + "# Write your polysemous word exploration code here.\n", + "\n", + "wv_from_bin.most_similar(\"tail\")\n", + "#wv_from_bin.most_similar(\"lie\")\n", + "# ------------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer\n", + "\n", + "\n", + "** polysemous word - tail\n", + "1. hind - towards the end\n", + "2. tails - animal tails\n", + "\n", + "** polysemous word - lie\n", + "1. lies - to lie\n", + "2. lay - lay down\n", + "\n", + "I tried right, fan, star, one and may others which didn't work. The reason is that the data source the algorithm is trained on is biased towards more common use cases than even distribution of different meanings the word may have." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2.3: Synonyms & Antonyms (2 points) [code + written] \n", + "\n", + "When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.\n", + "\n", + "Find three words (w1,w2,w3) where w1 and w2 are synonyms and w1 and w3 are antonyms, but Cosine Distance(w1,w3) < Cosine Distance(w1,w2). For example, w1=\"happy\" is closer to w3=\"sad\" than to w2=\"cheerful\". \n", + "\n", + "Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.\n", + "\n", + "You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance." + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Synonyms bad, evil have cosine distance: 0.6706309994333031\n", + "Antonyms bad, good have cosine distance: 0.28099487917237653\n" + ] + } + ], + "source": [ + "# ------------------\n", + "# Write your synonym & antonym exploration code here.\n", + "\n", + "w1 = \"bad\"\n", + "w2 = \"evil\"\n", + "w3 = \"good\"\n", + "w1_w2_dist = wv_from_bin.distance(w1, w2)\n", + "w1_w3_dist = wv_from_bin.distance(w1, w3)\n", + "\n", + "print(\"Synonyms {}, {} have cosine distance: {}\".format(w1, w2, w1_w2_dist))\n", + "print(\"Antonyms {}, {} have cosine distance: {}\".format(w1, w3, w1_w3_dist))\n", + "\n", + "# ------------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Write your answer here.\n", + "\n", + "CosineDist(Bad and Evil) > CosineDist(Bad and Good)\n", + "\n", + "The word2vec alogrithm uses localized windows to group words together and find context. Synonyms are not typically used all together in the same sentence as much as antonyms.\n", + "For our example - If the text corpus is coming from news articles. On news articles, It is very common to critically compare contrasting ideas using antonyms than using synonyms to describe the same ideas in a context.\n", + "\n", + "Word2vec算法使用窗口将单词组合在一起并查找上下文。 同义词通常不像反义词那样全部用在同一个句子中。 例如-如果文本语料库来自新闻文章。 在新闻文章中,使用反义词批判性地比较不同的观点比使用同义词在上下文中描述相同的观点更为常见。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Solving Analogies with Word Vectors\n", + "Word2Vec vectors have been shown to *sometimes* exhibit the ability to solve analogies. \n", + "\n", + "As an example, for the analogy \"man : king :: woman : x\", what is x?\n", + "\n", + "In the cell below, we show you how to use word vectors to find x. The `most_similar` function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will be the word ranked most similar (largest numerical value).\n", + "\n", + "**Note:** Further Documentation on the `most_similar` function can be found within the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__." + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[('queen', 0.7118192911148071),\n", + " ('monarch', 0.6189674139022827),\n", + " ('princess', 0.5902431607246399),\n", + " ('crown_prince', 0.5499460697174072),\n", + " ('prince', 0.5377321243286133),\n", + " ('kings', 0.5236844420433044),\n", + " ('Queen_Consort', 0.5235945582389832),\n", + " ('queens', 0.5181134343147278),\n", + " ('sultan', 0.5098593235015869),\n", + " ('monarchy', 0.5087411999702454)]\n" + ] + } + ], + "source": [ + "# Run this cell to answer the analogy -- man : king :: woman : x\n", + "pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2.4: Finding Analogies [code + written] (2 Points)\n", + "Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.\n", + "\n", + "**Note**: You may have to try many analogies to find one that works!" + ] + }, + { + "cell_type": "code", + "execution_count": 128, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[('actress', 0.860262393951416),\n", + " ('actresses', 0.6596669554710388),\n", + " ('thesp', 0.629091739654541),\n", + " ('Actress', 0.6165294051170349),\n", + " ('actress_Rachel_Weisz', 0.5997322797775269),\n", + " ('Best_Actress', 0.5896061658859253),\n", + " ('actors', 0.5714285373687744),\n", + " ('LIEV_SCHREIBER', 0.5616893768310547),\n", + " ('Oscarwinning', 0.5589709281921387),\n", + " ('Susan_Penhaligon', 0.5582746267318726)]\n" + ] + } + ], + "source": [ + "# ------------------\n", + "# Write your analogy exploration code here.\n", + "\n", + "pprint.pprint(wv_from_bin.most_similar(positive=['woman','actor'], negative=['man']))\n", + "\n", + "# ------------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Write your answer here.\n", + "\n", + "The analogy that works is -\n", + "man:actor :: woman:actress" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2.5: Incorrect Analogy [code + written] (1 point)\n", + "Find an example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the (incorrect) value of b according to the word vectors." + ] + }, + { + "cell_type": "code", + "execution_count": 194, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[('pink_flamingo', 0.43736279010772705),\n", + " ('Downy_Woodpecker', 0.4316130578517914),\n", + " ('rooster', 0.43012362718582153),\n", + " ('blue_grosbeak', 0.4159660339355469),\n", + " ('raven', 0.40570688247680664),\n", + " ('conure', 0.40150076150894165),\n", + " ('colorful_plumage', 0.39959049224853516),\n", + " ('Baltimore_oriole', 0.39683181047439575),\n", + " ('plastic_flamingo', 0.39603763818740845),\n", + " ('heron', 0.39494696259498596)]\n" + ] + } + ], + "source": [ + "# ------------------\n", + "# Write your incorrect analogy exploration code here.\n", + "\n", + "pprint.pprint(wv_from_bin.most_similar(positive=['America','peacock'], negative=['India']))\n", + "\n", + "\n", + "# ------------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer\n", + "\n", + "India:peacock::America:Eagle\n", + "\n", + "The national bird of India is peacock, the vectors should have predicted a linear relation to the national bird of America which is the Eagle.\n", + "\n", + "印度的国鸟是孔雀,向量应该已经预测到,美国的国鸟是鹰,的线性关系。\n", + "\n", + "Incorrect b is pink_flamingo" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2.6: Guided Analysis of Bias in Word Vectors [written] (1 point)\n", + "\n", + "It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit to our word embeddings.\n", + "\n", + "Run the cell below, to examine (a) which terms are most similar to \"woman\" and \"boss\" and most dissimilar to \"man\", and (b) which terms are most similar to \"man\" and \"boss\" and most dissimilar to \"woman\". What do you find in the top 10?" + ] + }, + { + "cell_type": "code", + "execution_count": 188, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[('bosses', 0.5522644519805908),\n", + " ('manageress', 0.49151360988616943),\n", + " ('exec', 0.45940813422203064),\n", + " ('Manageress', 0.45598435401916504),\n", + " ('receptionist', 0.4474116563796997),\n", + " ('Jane_Danson', 0.44480544328689575),\n", + " ('Fiz_Jennie_McAlpine', 0.44275766611099243),\n", + " ('Coronation_Street_actress', 0.44275566935539246),\n", + " ('supremo', 0.4409853219985962),\n", + " ('coworker', 0.43986251950263977)]\n", + "[('supremo', 0.6097398400306702),\n", + " ('MOTHERWELL_boss', 0.5489562153816223),\n", + " ('CARETAKER_boss', 0.5375303626060486),\n", + " ('Bully_Wee_boss', 0.5333974361419678),\n", + " ('YEOVIL_Town_boss', 0.5321705341339111),\n", + " ('head_honcho', 0.5281980037689209),\n", + " ('manager_Stan_Ternent', 0.525971531867981),\n", + " ('Viv_Busby', 0.5256162881851196),\n", + " ('striker_Gabby_Agbonlahor', 0.5250812768936157),\n", + " ('BARNSLEY_boss', 0.5238943099975586)]\n" + ] + } + ], + "source": [ + "pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))\n", + "pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Write your answer here.\n", + "\n", + "There seems to be some bias in the vectors generated around meanings of boss with respect to gender.\n", + "\n", + "On one end, vectors around woman and boss (dissimilar to man) shows jobs roles like receptionist, name of actresses and some gender neutral words like supremo,co-worker,bosses\n", + "\n", + "On the other hand, looking at vectors around man and boss (dissimilar to woman) has most words related to football team managers and male football players with the one common gender neutral word (supremo).\n", + "\n", + "The vectors seem to have learnt some discrepency in employment roles among genders.\n", + "\n", + "似乎有一些偏见的矢量产生的意义上的老板与性别有关。\n", + "\n", + "\n", + "一方面,围绕女性和老板(与男性不同)展示了诸如接待员、女演员名字等工作角色,以及一些中性词汇,如至高无上、同事、老板等\n", + "\n", + "\n", + "另一方面,研究围绕着男人和老板(不同于女人)的矢量,大部分词汇都与足球队经理和男足球员有关,只有一个共同的中性词汇(至高无上)。\n", + "\n", + "\n", + "两性之间在就业角色上似乎存在着差异。\n", + "\n", + "答案为个人意见,仅供参考" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2.7: Independent Analysis of Bias in Word Vectors [code + written] (2 points)\n", + "\n", + "Use the `most_similar` function to find another case where some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover." + ] + }, + { + "cell_type": "code", + "execution_count": 189, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[('physicist', 0.6613448858261108),\n", + " ('Physicist', 0.5998005270957947),\n", + " ('theoretical_physicist', 0.5916016697883606),\n", + " ('atmospheric_chemist', 0.5844619274139404),\n", + " ('researcher', 0.5844188332557678),\n", + " ('biologist', 0.5710763335227966),\n", + " ('atmospheric_physicist', 0.569758415222168),\n", + " ('geneticist', 0.5594449043273926),\n", + " ('biochemist', 0.5531430840492249),\n", + " ('mathematician', 0.5528060793876648)]\n", + "[('researcher', 0.7213959097862244),\n", + " ('biologist', 0.5944803953170776),\n", + " ('geneticist', 0.5939854383468628),\n", + " ('microbiologist', 0.5772261619567871),\n", + " ('professor', 0.5715740919113159),\n", + " ('biochemist', 0.5685313940048218),\n", + " ('physicist', 0.5617247819900513),\n", + " ('Researcher', 0.5584947466850281),\n", + " ('anthropologist', 0.5538322925567627),\n", + " ('molecular_biologist', 0.5461256504058838)]\n" + ] + } + ], + "source": [ + "# ------------------\n", + "# Write your bias exploration code here.\n", + "pprint.pprint(wv_from_bin.most_similar(positive=['man','scientist'], negative=['woman']))\n", + "pprint.pprint(wv_from_bin.most_similar(positive=['woman','scientist'], negative=['man']))\n", + "\n", + "# ------------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Write your answer here.\n", + "\n", + "In the above example, the scientist roles that are similar to man but dissimilar to woman seems to cover a range of careers that are exclusive to men for example - physicist, theoritical physicist, atmospheric_physicist,mathematician etc.
\n", + "On the other hand, the scientist roles learnt by the embeddings that are similar to woman and dissimilar to man are more like biologist, geneticist, micobiologist,biochemist this exhibits a clear bias in gender specific career choices. \n", + "\n", + "Careers like physicist have more similarity to man than woman.\n", + "\n", + "男性更接近物理学家、理论物理学家、大气物理学家、数学家等\n", + "\n", + "女性更接近生物学家、遗传学家、微生物学家、生物化学家等\n", + "\n", + "答案为个人意见,仅供参考" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2.8: Thinking About Bias [written] (1 point)\n", + "\n", + "What might be the cause of these biases in the word vectors?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Write your answer here.\n", + "\n", + "Bias in training data can be captured by word embeddings and translated into vector space. Often news articles exhibit bias related to race, religion, gender, sexual orientation etc. The training objective is to maximize the probability of prediciting the next word correctly which means if the context windows have bias terms they will likely be captured by the scheme. For example - In the sentence, \"grandmother was a nurse\", the algorithm can learn to predict nurse as one of the next words everytime it sees grandmother rather than any other occupation unless exposed to more data which helps it generalize better.\n", + "\n", + "训练数据中的偏差可以通过词嵌入来捕获并转化为向量空间。 通常新闻报道会表现出与种族、宗教、性别、性取向等相关的偏见。 训练的目的是最大限度地提高正确预测下一个单词的概率,这意味着如果上下文窗口有偏差项,它们可能会被方案捕获。 例如——在“祖母是护士”这个句子中,算法可以学习预测护士是每次见到祖母时的下一个词,而不是其他任何职业,除非接触到更多的数据让它更好地归纳。\n", + "\n", + "答案为个人意见,仅供参考" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Submission Instructions\n", + "\n", + "1. Click the Save button at the top of the Jupyter Notebook.\n", + "2. Please make a Gradescope account using your @stanford.edu email address (this is very important to help us enter your grade at the end of the quarter), and ensure your SUID is entered at the top of this notebook too.\n", + "3. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of all cells). \n", + "4. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.\n", + "5. Once you've rerun everything, select File -> Download as -> PDF via LaTeX\n", + "6. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing your graders will see!\n", + "7. Submit your PDF on Gradescope." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/Assignment_1_intro_word_vectors/imgs/inner_product.png b/Assignment_1_intro_word_vectors/imgs/inner_product.png new file mode 100644 index 0000000..09ad7df Binary files /dev/null and b/Assignment_1_intro_word_vectors/imgs/inner_product.png differ diff --git a/Assignment_1_intro_word_vectors/imgs/svd.png b/Assignment_1_intro_word_vectors/imgs/svd.png new file mode 100644 index 0000000..a33b316 Binary files /dev/null and b/Assignment_1_intro_word_vectors/imgs/svd.png differ diff --git a/Assignment_1_intro_word_vectors/imgs/test_plot.png b/Assignment_1_intro_word_vectors/imgs/test_plot.png new file mode 100644 index 0000000..c2f5f00 Binary files /dev/null and b/Assignment_1_intro_word_vectors/imgs/test_plot.png differ diff --git a/Assignment_1_intro_word_vectors/imgs/word2vec-king-queen-composition.png b/Assignment_1_intro_word_vectors/imgs/word2vec-king-queen-composition.png new file mode 100644 index 0000000..a44fea5 Binary files /dev/null and b/Assignment_1_intro_word_vectors/imgs/word2vec-king-queen-composition.png differ diff --git a/Assignment_1_intro_word_vectors/output_21_1.png b/Assignment_1_intro_word_vectors/output_21_1.png new file mode 100644 index 0000000..73cc0d2 Binary files /dev/null and b/Assignment_1_intro_word_vectors/output_21_1.png differ diff --git a/Assignment_1_intro_word_vectors/output_24_1.png b/Assignment_1_intro_word_vectors/output_24_1.png new file mode 100644 index 0000000..0791f45 Binary files /dev/null and b/Assignment_1_intro_word_vectors/output_24_1.png differ diff --git a/Assignment_1_intro_word_vectors/output_34_0.png b/Assignment_1_intro_word_vectors/output_34_0.png new file mode 100644 index 0000000..8e5502f Binary files /dev/null and b/Assignment_1_intro_word_vectors/output_34_0.png differ diff --git a/Assignment_1_intro_word_vectors/python review.ipynb b/Assignment_1_intro_word_vectors/python review.ipynb new file mode 100644 index 0000000..ecad104 --- /dev/null +++ b/Assignment_1_intro_word_vectors/python review.ipynb @@ -0,0 +1,963 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python Numpy Review\n", + "\n", + "主要复习numpy\n", + "\n", + "tutor: `chongjiujin # gmail.com`\n", + "\n", + "```\n", + "if you have any question in python or pytorch:\n", + "\n", + " print(add personal weichat:flypython)\n", + " ```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# List Slicing\n", + "\n", + "List elements can be accessed in convenient ways.\n", + "\n", + "Basic format: some_list[start_index:end_index]" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0, 1, 2, 3, 4, 5, 6]" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "numbers = [0, 1, 2, 3, 4, 5, 6]\n", + "numbers" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0, 1, 2]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "numbers[0:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0, 1, 2, 3]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "numbers[:4]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[5, 6]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "numbers[5:]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0, 1, 2, 3, 4, 5, 6]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "numbers[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "6" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Negative index wraps around\n", + "numbers[-1]" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[4, 5, 6]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "numbers[-3:]" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Can mix and match\n", + "numbers[1:-10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Numpy python矩阵计算库\n", + "\n", + "\n", + "Optimized library for matrix and vector computation.\n", + "\n", + "用于矩阵和向量\n", + "\n", + "\n", + "\n", + "Makes use of C/C++ subroutines and memory-efficient data structures.\n", + "\n", + "底层是C/C++编译的,效率更高\n", + "\n", + "(Lots of computation can be efficiently represented as vectors.)\n", + "\n", + "**Main data type: `np.ndarray`**\n", + "\n", + "This is the data type that you will use to represent matrix/vector computations.\n", + "这个数据结构是用来放矩阵/向量的\n", + "\n", + "Note: constructor function is `np.array()`\n", + "\n", + " `np.array()`初始化函数\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np#导入库" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3,)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = np.array([1,2,3])#一维向量\n", + "x\n", + "x.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2, 3)" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y = np.array([[3,4,5],[6,7,8]])#二维矩阵\n", + "y.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3, 1)" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y = np.array([[1],[2],[3]])#每个框是增加一个维度\n", + "y.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# np.ndarray Operations 操作函数\n", + "\n", + "Reductions: `np.max`, `np.min`, `np.argmax`, `np.sum`, `np.mean`, …\n", + "\n", + "Always reduces along an axis! (Or will reduce along all axes if not specified.)\n", + "\n", + "(You can think of this as “collapsing” this axis into the function’s output.)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = np.array([1,2,3])#一维向量\n", + "x.max()#np.max(x)\n", + "#x.min()\n", + "#x.sum()\n", + "#x.mean()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[5],\n", + " [8]])" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y = np.array([[3,4,5],[6,7,8]])#按维度取最大值\n", + "#np.max(y,axis = 1)\n", + "np.max(y, axis = 1, keepdims = True)\n", + "#https://docs.scipy.org/doc/numpy/reference/generated/numpy.amax.html#numpy.amax" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 基本矩阵运算\n", + "\n", + "\n", + "`np.dot`矩阵点乘\n", + "$$ np.dot(v,w)=v^T w $$\n", + "https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html?highlight=dot#numpy.dot\n", + "\n", + "`np.multiply` 在 np.array 中重载为元素乘法,在 np.matrix 中重载为矩阵乘法\n", + "\n", + "https://docs.scipy.org/doc/numpy/reference/generated/numpy.multiply.html\n", + "\n", + "\n", + "我们这里只讨论一维向量" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "14" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#np.dot点乘\n", + "\n", + "x=np.array([1,2,3])#一维向量\n", + "y=np.array([1,2,3])#一维向量\n", + "np.dot(x,y)\n", + "#" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "14" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sum(x.T*y)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 4, 9])" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x=np.array([1,2,3])#一维向量\n", + "np.multiply(x,x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Indexing 索引" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([3])" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#基本同list\n", + "x = np.array([1,2,3])#一维向量\n", + "x[x > 2]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([3, 2, 1])" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "index=[2,1,0]#按索引排序\n", + "x[index]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 矩阵遍历\n", + "\n", + "有时候需要遍历矩阵里所有的向量" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[3 4 5]\n", + " [6 7 8]]\n" + ] + } + ], + "source": [ + "y = np.array([[3,4,5],[6,7,8]])#二维矩阵\n", + "print(y)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[3 4 5]\n", + "-----\n", + "[6 7 8]\n", + "-----\n" + ] + } + ], + "source": [ + "#默认按第1维度遍历\n", + "for y1 in y:\n", + " print(y1)\n", + " print(\"-----\")" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2 3\n" + ] + } + ], + "source": [ + "#按指定维度遍历\n", + "d1,d2= y.shape\n", + "print(d1,d2)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 [3 6]\n", + "1 [4 7]\n", + "2 [5 8]\n" + ] + } + ], + "source": [ + "for d in range(d2):\n", + " print(d,y[:,d])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Efficient Numpy Code\n", + "尽量用Numpy的特性提升效率" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "x = np.array([[3,4,5],[6,7,8]])#二维矩阵\n", + "y = np.array([[1,2,3],[9,0,10]])#二维矩阵" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 9, 16, 25],\n", + " [36, 49, 64]])" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "for i in range(x.shape[0]):\n", + " for j in range(x.shape[1]):\n", + " x[i,j] **= 2\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 81, 256, 625],\n", + " [1296, 2401, 4096]])" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x **= 2\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 全0 和全 1 矩阵" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1.30950800e+06, 1.82888704e+08])" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "z=np.zeros((2,))\n", + "for i in range(x.shape[0]):\n", + " x1=x[i]\n", + " y1=y[i]\n", + " z[i]=np.dot(x1,y1)\n", + "z" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[1., 1., 1.],\n", + " [1., 1., 1.]])" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "z=np.ones((2,3))\n", + "z" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 矩阵和常数计算以及 Broadcasting广播" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3, 3)" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = np.array([[3,4,5],[6,7,8],[1,2,3]])#二维矩阵\n", + "x.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 5, 6, 7],\n", + " [ 8, 9, 10],\n", + " [ 3, 4, 5]])" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x+2" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 6, 8, 10],\n", + " [12, 14, 16],\n", + " [ 2, 4, 6]])" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x*2" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3, 1)" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y=np.array([[2],[4],[8]])\n", + "y.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 5, 6, 7],\n", + " [10, 11, 12],\n", + " [ 9, 10, 11]])" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x+y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 矩阵变换" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(1, 3)" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "z=np.array([[2, 4, 8]])\n", + "z.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3, 1)" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "z=y.reshape(-1,1)\n", + "z.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(1, 3)" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "z=y.T\n", + "z.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 6, 16, 40],\n", + " [12, 28, 64],\n", + " [ 2, 8, 24]])" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x*z" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 思考题\n", + "y=np.array([[2],[4],[8]])\n", + "\n", + "(y + y.T)是什么\n", + "\n", + "\n", + "# 如果对操作有不确定,开一个jupyter notebook,测试后使用" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/README.md b/README.md new file mode 100644 index 0000000..daf9535 --- /dev/null +++ b/README.md @@ -0,0 +1,32 @@ +### Stanford / Winter 2019 + +# CS224N-Stanford-Winter-2019 +The collection of ALL relevant materials about CS224N-Stanford/Winter 2019 course. THANKS TO THE PROFESSOR AND TAs! +All the rights of the relevant materials belong to Standfor University. +斯坦福大学CS224N 【2019】课程的【所有】相关的资料。感谢Chris Manning教授和Abigail See,感谢所有助教! + +----------- + + + +Assignment 1 : Introduction to word vectors + +Assignment 2 : Derivatives and implementation of word2vec algorithm + +Assignment 3 : Dependency parsing and neural network foundations + +Assignment 4 : Neural Machine Translation with sequence-to-sequence and attention + +Assignment 5 : Neural Machine Translation with ConvNets and subword modeling + + +Lecture + + + 1. cs224n-2019-lecture01-wordvecs1 + 2. cs224n-2019-lecture02-wordvecs2 + 3. cs224n-2019-lecture03-neuralnets + 4. cs224n-2019-lecture04-backprop + 5. cs224n-2019-lecture05-dep-parsing + 6. cs224n-2019-lecture06-rnnlm +... \ No newline at end of file