{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's look at another example of using distances and similarities. This time, we'll look at a hypothetical search engine index and see how we can use k-nearest-neighbor search to identify the documents that are most similar to a specified query. A similar approach can be used to find objects that are most similar to a specified object in other domains."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### In our data set, we have 15 documents. We assume that the documents have already been preprocessed, converted into word vectors (bags of words), and inserted into an index. After preprocessing and removing \"stop words\" we are left with 10 index terms (used as dimensions for the document vectors). "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" database | \n",
" 24 | \n",
" 32 | \n",
" 12 | \n",
" 6 | \n",
" 43 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 6 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" index | \n",
" 9 | \n",
" 5 | \n",
" 5 | \n",
" 2 | \n",
" 20 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 27 | \n",
" 14 | \n",
" 3 | \n",
" 2 | \n",
" 11 | \n",
"
\n",
" \n",
" 2 | \n",
" likelihood | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 7 | \n",
" 12 | \n",
" 4 | \n",
" 27 | \n",
" 4 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" linear | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 16 | \n",
" 0 | \n",
" 2 | \n",
" 25 | \n",
" 23 | \n",
" 7 | \n",
" 12 | \n",
" 21 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" matrix | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 33 | \n",
" 2 | \n",
" 0 | \n",
" 7 | \n",
" 12 | \n",
" 14 | \n",
" 5 | \n",
" 12 | \n",
" 4 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" query | \n",
" 12 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 27 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 22 | \n",
" 9 | \n",
" 4 | \n",
" 0 | \n",
" 5 | \n",
" 3 | \n",
"
\n",
" \n",
" 6 | \n",
" regression | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 18 | \n",
" 32 | \n",
" 22 | \n",
" 34 | \n",
" 17 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" retrieval | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 9 | \n",
" 27 | \n",
" 7 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" 8 | \n",
" sql | \n",
" 21 | \n",
" 10 | \n",
" 16 | \n",
" 7 | \n",
" 31 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" vector | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 27 | \n",
" 4 | \n",
" 2 | \n",
" 11 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 14 | \n",
" 7 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n",
"0 database 24 32 12 6 43 2 0 3 1 6 4 0 0 0 0\n",
"1 index 9 5 5 2 20 0 1 0 0 0 27 14 3 2 11\n",
"2 likelihood 0 3 0 0 3 7 12 4 27 4 0 1 0 0 0\n",
"3 linear 3 0 0 0 0 16 0 2 25 23 7 12 21 3 2\n",
"4 matrix 1 0 0 0 0 33 2 0 7 12 14 5 12 4 0\n",
"5 query 12 2 0 0 27 0 0 0 0 22 9 4 0 5 3\n",
"6 regression 0 0 0 0 0 18 32 22 34 17 0 0 0 0 0\n",
"7 retrieval 1 0 0 0 2 0 0 0 3 9 27 7 5 4 4\n",
"8 sql 21 10 16 7 31 0 0 0 0 0 0 0 0 1 0\n",
"9 vector 2 0 0 2 0 27 4 2 11 8 33 16 14 7 3"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DF = pd.read_csv(\"http://facweb.cs.depaul.edu/mobasher/classes/csc478/data/term-doc-mat.csv\", header=None)\n",
"DF"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's remove the column containing the terms"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
" 15 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 24 | \n",
" 32 | \n",
" 12 | \n",
" 6 | \n",
" 43 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 6 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 9 | \n",
" 5 | \n",
" 5 | \n",
" 2 | \n",
" 20 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 27 | \n",
" 14 | \n",
" 3 | \n",
" 2 | \n",
" 11 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 7 | \n",
" 12 | \n",
" 4 | \n",
" 27 | \n",
" 4 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 16 | \n",
" 0 | \n",
" 2 | \n",
" 25 | \n",
" 23 | \n",
" 7 | \n",
" 12 | \n",
" 21 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 33 | \n",
" 2 | \n",
" 0 | \n",
" 7 | \n",
" 12 | \n",
" 14 | \n",
" 5 | \n",
" 12 | \n",
" 4 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" 12 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 27 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 22 | \n",
" 9 | \n",
" 4 | \n",
" 0 | \n",
" 5 | \n",
" 3 | \n",
"
\n",
" \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 18 | \n",
" 32 | \n",
" 22 | \n",
" 34 | \n",
" 17 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 9 | \n",
" 27 | \n",
" 7 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" 8 | \n",
" 21 | \n",
" 10 | \n",
" 16 | \n",
" 7 | \n",
" 31 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 27 | \n",
" 4 | \n",
" 2 | \n",
" 11 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 14 | \n",
" 7 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n",
"0 24 32 12 6 43 2 0 3 1 6 4 0 0 0 0\n",
"1 9 5 5 2 20 0 1 0 0 0 27 14 3 2 11\n",
"2 0 3 0 0 3 7 12 4 27 4 0 1 0 0 0\n",
"3 3 0 0 0 0 16 0 2 25 23 7 12 21 3 2\n",
"4 1 0 0 0 0 33 2 0 7 12 14 5 12 4 0\n",
"5 12 2 0 0 27 0 0 0 0 22 9 4 0 5 3\n",
"6 0 0 0 0 0 18 32 22 34 17 0 0 0 0 0\n",
"7 1 0 0 0 2 0 0 0 3 9 27 7 5 4 4\n",
"8 21 10 16 7 31 0 0 0 0 0 0 0 0 1 0\n",
"9 2 0 0 2 0 27 4 2 11 8 33 16 14 7 3"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# TD will be the termxdocument matrix\n",
"TD = DF.iloc[:,1:]\n",
"TD"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" 11 | \n",
" 12 | \n",
" 13 | \n",
" 14 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 24 | \n",
" 32 | \n",
" 12 | \n",
" 6 | \n",
" 43 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 6 | \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 9 | \n",
" 5 | \n",
" 5 | \n",
" 2 | \n",
" 20 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 27 | \n",
" 14 | \n",
" 3 | \n",
" 2 | \n",
" 11 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 7 | \n",
" 12 | \n",
" 4 | \n",
" 27 | \n",
" 4 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 16 | \n",
" 0 | \n",
" 2 | \n",
" 25 | \n",
" 23 | \n",
" 7 | \n",
" 12 | \n",
" 21 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 33 | \n",
" 2 | \n",
" 0 | \n",
" 7 | \n",
" 12 | \n",
" 14 | \n",
" 5 | \n",
" 12 | \n",
" 4 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" 12 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 27 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 22 | \n",
" 9 | \n",
" 4 | \n",
" 0 | \n",
" 5 | \n",
" 3 | \n",
"
\n",
" \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 18 | \n",
" 32 | \n",
" 22 | \n",
" 34 | \n",
" 17 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 9 | \n",
" 27 | \n",
" 7 | \n",
" 5 | \n",
" 4 | \n",
" 4 | \n",
"
\n",
" \n",
" 8 | \n",
" 21 | \n",
" 10 | \n",
" 16 | \n",
" 7 | \n",
" 31 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 27 | \n",
" 4 | \n",
" 2 | \n",
" 11 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 14 | \n",
" 7 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14\n",
"0 24 32 12 6 43 2 0 3 1 6 4 0 0 0 0\n",
"1 9 5 5 2 20 0 1 0 0 0 27 14 3 2 11\n",
"2 0 3 0 0 3 7 12 4 27 4 0 1 0 0 0\n",
"3 3 0 0 0 0 16 0 2 25 23 7 12 21 3 2\n",
"4 1 0 0 0 0 33 2 0 7 12 14 5 12 4 0\n",
"5 12 2 0 0 27 0 0 0 0 22 9 4 0 5 3\n",
"6 0 0 0 0 0 18 32 22 34 17 0 0 0 0 0\n",
"7 1 0 0 0 2 0 0 0 3 9 27 7 5 4 4\n",
"8 21 10 16 7 31 0 0 0 0 0 0 0 0 1 0\n",
"9 2 0 0 2 0 27 4 2 11 8 33 16 14 7 3"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reindex the columns to start from 0\n",
"TD.columns= range(15)\n",
"TD"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 database\n",
"1 index\n",
"2 likelihood\n",
"3 linear\n",
"4 matrix\n",
"5 query\n",
"6 regression\n",
"7 retrieval\n",
"8 sql\n",
"9 vector\n",
"Name: 0, dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The list of our index terms\n",
"terms = DF.iloc[:,0]\n",
"terms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Transposing the TD matrix."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"DT = TD.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Now we have a document-term matrix:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 24 | \n",
" 9 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 12 | \n",
" 0 | \n",
" 1 | \n",
" 21 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 32 | \n",
" 5 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 10 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 12 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 16 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 7 | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 43 | \n",
" 20 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 27 | \n",
" 0 | \n",
" 2 | \n",
" 31 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" 0 | \n",
" 7 | \n",
" 16 | \n",
" 33 | \n",
" 0 | \n",
" 18 | \n",
" 0 | \n",
" 0 | \n",
" 27 | \n",
"
\n",
" \n",
" 6 | \n",
" 0 | \n",
" 1 | \n",
" 12 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 32 | \n",
" 0 | \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" 7 | \n",
" 3 | \n",
" 0 | \n",
" 4 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 22 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
"
\n",
" \n",
" 8 | \n",
" 1 | \n",
" 0 | \n",
" 27 | \n",
" 25 | \n",
" 7 | \n",
" 0 | \n",
" 34 | \n",
" 3 | \n",
" 0 | \n",
" 11 | \n",
"
\n",
" \n",
" 9 | \n",
" 6 | \n",
" 0 | \n",
" 4 | \n",
" 23 | \n",
" 12 | \n",
" 22 | \n",
" 17 | \n",
" 9 | \n",
" 0 | \n",
" 8 | \n",
"
\n",
" \n",
" 10 | \n",
" 4 | \n",
" 27 | \n",
" 0 | \n",
" 7 | \n",
" 14 | \n",
" 9 | \n",
" 0 | \n",
" 27 | \n",
" 0 | \n",
" 33 | \n",
"
\n",
" \n",
" 11 | \n",
" 0 | \n",
" 14 | \n",
" 1 | \n",
" 12 | \n",
" 5 | \n",
" 4 | \n",
" 0 | \n",
" 7 | \n",
" 0 | \n",
" 16 | \n",
"
\n",
" \n",
" 12 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 21 | \n",
" 12 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 0 | \n",
" 14 | \n",
"
\n",
" \n",
" 13 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 4 | \n",
" 1 | \n",
" 7 | \n",
"
\n",
" \n",
" 14 | \n",
" 0 | \n",
" 11 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 4 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9\n",
"0 24 9 0 3 1 12 0 1 21 2\n",
"1 32 5 3 0 0 2 0 0 10 0\n",
"2 12 5 0 0 0 0 0 0 16 0\n",
"3 6 2 0 0 0 0 0 0 7 2\n",
"4 43 20 3 0 0 27 0 2 31 0\n",
"5 2 0 7 16 33 0 18 0 0 27\n",
"6 0 1 12 0 2 0 32 0 0 4\n",
"7 3 0 4 2 0 0 22 0 0 2\n",
"8 1 0 27 25 7 0 34 3 0 11\n",
"9 6 0 4 23 12 22 17 9 0 8\n",
"10 4 27 0 7 14 9 0 27 0 33\n",
"11 0 14 1 12 5 4 0 7 0 16\n",
"12 0 3 0 21 12 0 0 5 0 14\n",
"13 0 2 0 3 4 5 0 4 1 7\n",
"14 0 11 0 2 0 3 0 4 0 3"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(15, 10)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DT.shape"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"numTerms=DT.shape[1]\n",
"NDocs = DT.shape[0]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10\n",
"15\n"
]
}
],
"source": [
"print(numTerms)\n",
"print(NDocs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Next, let's compute term frequencies to get an idea of their distributions across the corpus."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 133\n",
"1 99\n",
"2 61\n",
"3 114\n",
"4 90\n",
"5 84\n",
"6 123\n",
"7 62\n",
"8 86\n",
"9 129\n",
"dtype: int64\n"
]
}
],
"source": [
"termFreqs = TD.sum(axis=1)\n",
"print(termFreqs)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3Xl4VPW9x/H3d7KShYSQkLAHSFgjSBmp4Ia4ETesrdu9Lm2tVOtC7WLRep/ePq2trW1v3bCuLa0tdSmtG6AURcA9IErYE5A9C4QlIWT/3T8yarBBYrYzy+f1PHlm5syZzMcxfObMb845P3POISIi4cvndQAREelaKnoRkTCnohcRCXMqehGRMKeiFxEJcyp6EZEwp6IXEQlzKnoRkTCnohcRCXPRXgcASE9Pd9nZ2V7HEBEJKStWrNjjnMs41npBUfTZ2dkUFBR4HUNEJKSY2da2rKehGxGRMKeiFxEJcyp6EZEwp6IXEQlzKnoRkTCnohcRCXMqehGRMBfSRb+9oppfL1zP6h0H0JSIIiKtC4oDptrr/e37eXjpZmYvKWZArx5MG5NF/nF9GT8wFZ/PvI4nIhIULBi2hP1+v2vvkbEVh+pYtLaEBYUlvFG0h/pGR2bPOKaNyWJaXl8mDkkjSqUvImHIzFY45/zHXC/Ui76lA4freXV9KQtWl/D6xnJqG5ronRjL2WMyyc/ry6RhvYmJCunRKhGRT0Rk0bd0qLaBJRvKmV+4m9fWl1Fd10hKjxjOHJVJfl4WJ+emEx8T1anPKSLSnSK+6FuqqW9k6cZyFhaWsGhdKZU1DSTGRjE1UPpTRmSQEBvSX1eISARqa9Efs93M7AngfKDMOZcXWPYzYDrQBJQBX3fO7TIzA+4FzgWqA8tXtv8/o3PEx0Rx9pgszh6TRV1DE28W72FhYQmvrC3lhQ92ER/jY8rwPuQfl8XUkX1Ijo/xOrKISKc55ha9mZ0KVAF/blH0PZ1zBwPXbwFGO+euN7NzgZtpLvovA/c65758rBBdvUV/NA2NTby7pYIFhSW8vKaEsspaYqN8nJybzrS8LM4enUlqQmy35xIRaYtO26J3zi01s+zPLDvY4mYi8PG7xXSa3xAc8LaZpZpZX+fc7jYn70bRUT4m56QzOSedn144hpXb9rGgsISFhSW8ur6M233GpKG9yT8ui7NHZ5GRHOd1ZBGRL6xNY/SBon/x4y36wLK7gKuBA8DpzrlyM3sRuNs5tzywzmLgR865/9hcN7MZwAyAQYMGTdi6tU3nz+8WzjlW7zzwSelv2XMIMzghO438vCym5WXRN6WH1zFFJMJ16pexrRV9i/tuB+Kdcz8xs5eAX36m6G9zzq34vN/v1dBNWzjn2FBayfzVJSws3M3G0ioAjh+YSn5eFvl5fRnUO8HjlCISibqz6AcDLznn8szsYWCJc25u4L4NwJRjDd0Ec9F/VnF5FQsLS1hQuJvCnc0jWGP69eSCcf34ry8Poqe+yBWRbtLWom/X0UNmltvi5oXA+sD154GrrdmJwIFgHZ9vr2EZSdx4eg4v3nwKy247nR+fO4q4aB93L1jPSXe/ym9f2cC+Q3VexxQR+URb9rqZC0wB0oFS4Cc071UzgubdK7cC1zvndgZ2r3wAmEbz7pXfaG18/rNCaYv+aAp3HuCBV4tYuKaExNgorpw0mOtOGUp6kr7AFZGuoQOmPLKhpJIHXyvixQ93ERvt44qJg/j2qcPISon3OpqIhBkVvcc2l1fx4GvF/GvVTqLMuPSEAVx/2jAG9NIXtyLSOVT0QWLb3moeer2YZ1dsxzm4+Ev9+c6UHLLTE72OJiIhTkUfZHbtP8zDrxcz973tNDQ2Mf34/tx4+jBy+iR7HU1EQpSKPkiVHazh0WWbefLtbdQ0NHJuXl9umprDqL49vY4mIiFGRR/k9lbV8sQbW5jz5laqahs4a3QmN0/NYeyAVK+jiUiIUNGHiAPV9fzxzS08sXwLB2saOG14BreckcOEwWleRxORIKeiDzGVNfX85e2tPLZsCxWH6pg0tDc3n5HDpKG9aT48QUTkSCr6EFVd18Df3tnGw0s3U15Zi39wL26amsNpwzNU+CJyBBV9iKupb+Tpgu38YUkxuw7UMG5ACjdNzeXMUX1U+CICqOjDRl1DE/NW7mD2kmK2VVQzMiuZm6fmkp+Xhc+nwheJZCr6MNPQ2MRzq3bx4JIiNpcfIqdPEjeePowLxvYjOqpd56YTkRCnog9TjU2O+at388CrRWworSS7dwLfmZLDReP7ExutwheJJCr6MNfU5Fi0rpT7X91E4c6D9E/twfVThnGpfwBx0VFexxORbqCijxDOOZZsKOe+Vzfx/rb9ZPdO4P4rvsRxA1K8jiYiXaxLJx6R4GFmnD6yD/NumMycb06ktqGJix96g8eWbaapyfs3cRHxnoo+TJgZpw3PYMHMUzh9RB9+/tI6vjnnPfZU1XodTUQ8pqIPM6kJsTx81QR+Nn0MbxbvJf/eZbxRtMfrWCLiIRV9GDIzrpqUzXM3nkRKjxiufPwdfr1wPfWNTV5HExEPHLPozewJMyszs8IWy+4xs/Vm9qGZ/dPMUlvcd7uZFZnZBjM7p6uCy7GN6tuT5286icv8A5m9pJhLH36L7RXVXscSkW7Wli36P9E82XdLi4A859xYYCNwO4CZjQYuB8YEHjPbzLSvn4cSYqO5+6tjuf+K8RSVVnHufct46cPdXscSkW50zKJ3zi0FKj6z7BXnXEPg5tvAgMD16cDfnXO1zrktQBEwsRPzSjtdMK4f82eewrCMJG7820pun/chh+savY4lIt2gM8bovwksCFzvD2xvcd+OwDIJAgPTEnjm+klcf9ow5r67nQsfWM6GkkqvY4lIF+tQ0ZvZj4EG4K8fL2pltVZ35jazGWZWYGYF5eXlHYkhX0BMlI9Z+SP5y7UT2Vddz4UPLOfJt7cSDAfOiUjXaHfRm9k1wPnAf7tPW2IHMLDFagOAXa093jn3iHPO75zzZ2RktDeGtNMpuc373H95aG/u/FchNzy5kgPV9V7HEpEu0K6iN7NpwI+AC51zLXfjeB643MzizGwIkAu82/GY0hUykuP409dP4I5zR/LvdaXk37uU9z6qOPYDRSSktGX3yrnAW8AIM9thZtcCDwDJwCIzW2VmfwBwzq0BngbWAguBG51z+sYviPl8xoxTh/GPGyYTHeXjsoff4v7Fm2jU6RNEwoZOaiafqKyp585/FfLcql2cODSN3182nqyUeK9jichR6KRm8oUlx8fw+8uO556vjeWD7QfIv3cpi9eVeh1LRDpIRS9HMDMu8Q/kxVtOpm9KD66dU8BPX1hDbYNG4ERClYpeWjUsI4l535nM1ydn88c3PuLi2W+yubzK61gi0g4qejmq+Jgo/vfCMTx2tZ9d+w9z/v3LeXbFDu1zLxJiVPRyTGeOzmTBzFM5rn8KP3jmA259ahWVNdrnXiRUqOilTbJS4vnbdSfyvbOG8/wHuzj//uV8uGO/17FEpA1U9NJmUT7jljNyeerbk6hvaOKrD73Jo0s1ZaFIsFPRyxd2QnYa8wNTFt41fx3f+JOmLBQJZip6aZeWUxa+tbl5ysLlmzRloUgwUtFLu312ysKrnniHX2nKQpGgo6KXDms5ZeFDS4q55A9vsWOfpiwUCRYqeukULacsLC6r4srH3mF/dZ3XsUQEFb10sgvG9eOP3ziBnfsPc9Pf3qdBwzginlPRS6fzZ6dx10XHsbxoD3fNX+d1HJGIF+11AAlPl54wkHUlB/njGx8xMiuZy04Y5HUkkYilLXrpMj8+dxQn56Rz578KKdDMVSKeUdFLl4mO8vHAf42nf2oPrn9yBTv3H/Y6kkhEUtFLl0pNiOWxa/zU1jdx3ZwCqusavI4kEnFU9NLlcvokc98V41lXcpAfPPOBTnMs0s3aMjn4E2ZWZmaFLZZdYmZrzKzJzPyfWf92Mysysw1mdk5XhJbQc/rIPsyaNpL5q0u4/9Uir+OIRJS2bNH/CZj2mWWFwMXA0pYLzWw0cDkwJvCY2WYW1fGYEg5mnDqUi8f353eLNrKwsMTrOCIR45hF75xbClR8Ztk659yGVlafDvzdOVfrnNsCFAETOyWphDwz4xcXH8e4gal87+lVrNt90OtIIhGhs8fo+wPbW9zeEVj2H8xshpkVmFlBeXl5J8eQYBUfE8UjV00gOT6ab80pYK9ObyzS5Tq76K2VZa1+8+ace8Q553fO+TMyMjo5hgSzzJ7xPHKVn/KqWm7460rqGnSaBJGu1NlFvwMY2OL2AGBXJz+HhIFxA1P59VfH8u6WCn7y/BrtiSPShTq76J8HLjezODMbAuQC73byc0iYuGh8f64/bRhz393GX97e6nUckbB1zHPdmNlcYAqQbmY7gJ/Q/OXs/UAG8JKZrXLOneOcW2NmTwNrgQbgRudcY5ell5D3w3NGsKm0kp++sJacjCQm56R7HUkk7FgwfGT2+/2uoKDA6xjikcqaer4y+032VNXy3I0nMbh3oteRREKCma1wzvmPtZ6OjBXPJcfH8NjVfpyDb80poLKm3utIImFFRS9BITs9kdn//SU27znErU+toqnJ+0+aIuFCRS9B46ScdP7nvFH8e10Zv3mltePxRKQ9NPGIBJVrJmezobSS2UuKGZGVzPTjWz3eTkS+AG3RS1AxM356YR4nZPfitmc/5MMd+72OJBLyVPQSdGKjfTx05QTSk+KY8ecVlB2s8TqSSEhT0UtQSk+K45GrJ3DgcD0z/rKCmnodjiHSXip6CVpj+qXwu0vHsWr7fu6Yt1qnSRBpJxW9BLX84/ry3TNzmff+Th5btsXrOCIhSUUvQe+Wqbnk52XxywXreG1DmddxREKOil6Cns9n/PbScYzI6sktf3uforIqryOJhBQVvYSEhNhoHr16ArHRPq77cwEHqnWaBJG2UtFLyBjQK4E/XDWBHfuquWnuShoaNWGJSFuo6CWknJCdxs+m57Fs0x5+uWC913FEQoJOgSAh5/KJg1hfUsnjy7cwIiuZS/0Dj/0gkQimLXoJSXeeN4qTcnpz5z8LWbG1wus4IkFNRS8hKTrKx4P/9SX6psbz7b+sYOf+w15HEglaKnoJWakJsTx2tZ+a+iZm/LmAw3U6TYJIa45Z9Gb2hJmVmVlhi2VpZrbIzDYFLnsFlpuZ3WdmRWb2oZl9qSvDi+RmJnPv5cezdvdBfvDsBzpNgkgr2rJF/ydg2meWzQIWO+dygcWB2wD5QG7gZwbwUOfEFDm6M0Zlcts5I3npw908+FqR13FEgs4xi945txT47Ldd04E5getzgItaLP+za/Y2kGpmfTsrrMjRXH/aUC46vh+/eWUjL68p8TqOSFBp7xh9pnNuN0Dgsk9geX9ge4v1dgSWiXQpM+Pur45l3IAUbn1qFetLDnodSSRodPaXsdbKslYHTc1shpkVmFlBeXl5J8eQSBQfE8XDV/lJiovmW3MKqDhU53UkkaDQ3qIv/XhIJnD58SkFdwAtj14ZAOxq7Rc45x5xzvmdc/6MjIx2xhA5UlZKPA9fNYGyylpueHIF9TpNgki7i/554JrA9WuA51osvzqw982JwIGPh3hEusv4Qb341VeP450tFVz9+LuUaipCiXBt2b1yLvAWMMLMdpjZtcDdwFlmtgk4K3AbYD6wGSgCHgW+0yWpRY7hK+MHcM/XxrJq+36m/X4pi9eVeh1JxDMWDPsd+/1+V1BQ4HUMCUNFZVXcPPd91u0+yDdOymZW/kjioqO8jiXSKcxshXPOf6z1dGSshLWcPkn88zuT+frkbP74xkdcPPtNNpdr4hKJLCp6CXvxMVH874VjePRqPzv3H+b8+5fzjxU7vI4l0m1U9BIxzhqdyYKZp5DXP4XvP/MBtz61iqraBq9jiXQ5Fb1ElL4pPZh73YnceuZwnlu1k/PvW8bqHQe8jiXSpVT0EnGifMbMM3P5+4xJ1DY0cfFDb/DYss00NXm/Y4JIV1DRS8SaOCSNBTNP4fQRffj5S+v45pz32FNV63UskU6nopeIlpoQy8NXTeBn08fwZvFe8u9dxhtFe7yOJdKpVPQS8cyMqyZl89yNJ5HSI4YrH3+HXy9cr9MnSNhQ0YsEjOrbk+dvOolLJwxk9pJiLn34LbZXVHsdS6TDVPQiLSTERvOrr43l/ivGU1Raxbn3LeOlD3W6JgltKnqRVlwwrh/zZ57CsIwkbvzbSm6f96HmpJWQpaIXOYqBaQk8c/0krj9tGHPf3c6FDyxnQ0ml17FEvjAVvcjniInyMSt/JH+5diL7quu58IHlPPn2Vk1CLiFFRS/SBqfkZrBg5il8eWhv7vxXITc8uZID1fVexxJpExW9SBtlJMfxp6+fwB3njuTf60rJv3cp731U4XUskWNS0Yt8AT6fMePUYfzjhslER/m47OG3uH/xJhp1+gQJYip6kXYYNzCVl245mQvG9eO3izby34+9TckBTVkowUlFL9JOyfEx/P6y47nna2P5YPsB8u/VlIUSnFT0Ih1gZlziH8iLt5xM35QeXDungJ++sIbaBu1zL8GjQ0VvZjPNrNDM1pjZdwPL0sxskZltClz26pyoIsFrWEYS8zRloQSpdhe9meUB1wETgXHA+WaWC8wCFjvncoHFgdsiYe/jKQsfu9rPrsCUhc+u2KF97sVzHdmiHwW87Zyrds41AK8DXwGmA3MC68wBLupYRJHQcuboTBbMPJXj+qfwg2c+4LtPrWJDSaUKXzxj7f3jM7NRwHPAJOAwzVvvBcBVzrnUFuvtc879x/CNmc0AZgAMGjRowtatW9uVQyRYNTY5HnytiHsDu18OTU9kWl4W+Xl9yevfEzPzOqKEODNb4ZzzH3O9jmxlmNm1wI1AFbCW5sL/RluKviW/3+8KCgranUMkmJVV1vDymlIWFu7m7c0VNDY5BvTqQX5eFtPy+jJ+YCo+n0pfvrhuKfrPPOEvgB3ATGCKc263mfUFljjnRnzeY1X0EikqDtXx77WlLCjczfKiPdQ3OrJ6xnPOmEym5fVl4pA0olT60kbdtUXfxzlXZmaDgFdoHsa5A9jrnLvbzGYBac652z7v96joJRIdOFzPq+tLWbC6hNc3llPb0ETvxFjOHpNJfl5fJg3rTUyU9oCWo+uuol8G9Abqge855xabWW/gaWAQsA24xDn3uScEUdFLpDtU28CSDeUsKNzNa+vLOFTXSEqPGM4clUl+XhYn56YTHxPldUwJMt0+dNMRKnqRT9XUN7Js0x4WFO5m0dpSKmsaSIqL5vSRfTg3L4vTRmSQEBvtdUwJAm0tev21iASZ+JgozhqdyVmjM6lraOLN4j0sLCzhlbWlvPDBLuJjfEwZ3of847KYOrIPyfExXkeWIKctepEQ0dDYxLsfVbCwsISFhSWUVdYSG+Xj5Nx08vOyOGt0JqkJsV7HlG6koRuRMNbU5Hh/+z7mr24u/Z37DxPtMyYN6820vCzOHp1FRnKc1zGli6noRSKEc47VOw+wILClv2XPIXwG/uw0zg3sq5+VEu91TOkCKnqRCOScY0NpJQsCW/obSpsnMx8/KJUzR2Uyum9Pcvok0T+1hw7SCgMqehGhuLyKhYUlLCjcTeHOg58sj4/xMSwjidw+SeRmJjdfz0xicFoC0dp3P2So6EXkCAeq6ykqr2RTaRWbyqooCvzs3H/4k3Vioowh6Ynk9Ekip08yOX2a3wyGpCdqP/4gpN0rReQIKQkxTBicxoTBaUcsP1TbQHF5FZtKqygKXK7bXcnCwhI+ngrXZzAoLeGI8s8J/CTGqUaCnf4PiUS4xLhoxg5IZeyA1COW19Q38tHeQ598Aiguq2JTWSWvbyyjvvHTkYB+KfHkZCZ/Uv4fX2pXz+ChoheRVsXHRDEyqycjs3oesbyhsYmtFdWfDP1sKq2kqLyKv76zl5r6pk/WS0+KI6dPIrktPwVkJpGRFKdTNHczFb2IfCHRUc1f5A7LSOKcMZ8ub2py7Nx/+NM3gLJKNpVV8a9VO6msafhkPf/gXsydcaJO2NaNVPQi0il8PmNgWgID0xI4fWSfT5Y75yirrKWorIq3ivfywGtF/P3dbVw1Kdu7sBFGRS8iXcrMyOwZT2bPeCYP6817H1Xwf//exPTx/emp8/R0C312EpFuY2b8z/mj2Vddx+zXir2OEzFU9CLSrfL6p/CV8f15YvkWtldUex0nIqjoRaTb/fCcEfh88OuXN3gdJSKo6EWk2/VN6cGMU4bywge7WLltn9dxwp6KXkQ88e3ThpGRHMfPX1xLMJyKJZx1qOjN7FYzW2NmhWY218zizWyImb1jZpvM7Ckz0+FxIvIfEuOi+f5Zw1m5bT/zV5d4HSestbvozaw/cAvgd87lAVHA5cCvgP9zzuUC+4BrOyOoiISfS/wDGZmVzN0L11Hb0Oh1nLDV0aGbaKCHmUUDCcBuYCrwbOD+OcBFHXwOEQlTUT7jx+eNYnvFYea8+ZHXccJWu4veObcT+A2wjeaCPwCsAPY75z4+3nkH0L+jIUUkfJ2Sm8GUERnc/2oRFYfqvI4TljoydNMLmA4MAfoBiUB+K6u2+i2Lmc0wswIzKygvL29vDBEJA3ecO4pDtQ3ct3iT11HCUkeGbs4Etjjnyp1z9cA8YDKQGhjKARgA7Grtwc65R5xzfuecPyMjowMxRCTUDc9M5oqJg3jy7a0Ul1d5HSfsdKTotwEnmlmCNZ9z9AxgLfAa8LXAOtcAz3UsoohEglvPGk58TBS/nL/e6yhhpyNj9O/Q/KXrSmB14Hc9AvwI+J6ZFQG9gcc7IaeIhLn0pDhumDKMf68r5c3iPV7HCSuaM1ZEgkZNfSNn/PZ1UhNieOGmk/H5NEHJ52nrnLE6MlZEgkZ8TBS3TRvBml0Hmff+Tq/jhA0VvYgElQvG9mPcgBR+8/IGqusajv0AOSYVvYgEFZ/PuPP80ZQcrOHRpVu8jhMWVPQiEnROyE4jPy+Lh5cWU3awxus4IU9FLyJBaVb+SOobm/jtKxu9jhLyVPQiEpQG907kmknZPL1iO2t3HfQ6TkhT0YtI0Lp5ai4pPWL4xfx1Omd9B6joRSRopSTEcMvUXJYX7WHJBp0Tq71U9CIS1K48cTBD0hO5a/46GhqbvI4TklT0IhLUYqN9zMofSVFZFXPf2+51nJCkoheRoHf26EwmDknj94s2UllT73WckKOiF5GgZ2b8z3mj2XuojtlLir2OE3JU9CISEo4bkMLF4/vz+PItbK+o9jpOSFHRi0jI+ME5IzDgnpc3eB0lpKjoRSRk9EvtwYxTh/L8B7t4f9s+r+OEDBW9iISUb582jPSkOH7+kg6iaisVvYiElKS4aL5/9nBWbN3HgsISr+OEBBW9iIScS/0DGZmVzN0L1lPb0Oh1nKCnoheRkBPlM+44dxTbKqr585tbvY4T9Npd9GY2wsxWtfg5aGbfNbM0M1tkZpsCl706M7CICMCpwzM4bXgG97+6iX2H6ryOE9TaXfTOuQ3OueOdc8cDE4Bq4J/ALGCxcy4XWBy4LSLS6X583iiqahu4d/Emr6MEtc4aujkDKHbObQWmA3MCy+cAF3XSc4iIHGF4ZjKXTxzEk29vZXN5lddxglZnFf3lwNzA9Uzn3G6AwGWf1h5gZjPMrMDMCsrLdfpREWmfW88cTly0j18uWO91lKDV4aI3s1jgQuCZL/I459wjzjm/c86fkZHR0RgiEqEykuP4zuk5LFpbylvFe72OE5Q6Y4s+H1jpnCsN3C41s74AgcuyTngOEZGjuvbkIfRP7cFd89fS1KSDqD6rM4r+Cj4dtgF4HrgmcP0a4LlOeA4RkaOKj4nih+eMoHDnQf75/k6v4wSdDhW9mSUAZwHzWiy+GzjLzDYF7ru7I88hItIWF47rx7gBKdzz8gYO1+kgqpY6VPTOuWrnXG/n3IEWy/Y6585wzuUGLis6HlNE5PP5fMad54+m5GANjy7b7HWcoKIjY0UkbJyQnca0MVn84fViyg7WeB0naKjoRSSszMofSX1jE79btNHrKEFDRS8iYSU7PZGrJ2XzVMF21u0+6HWcoKCiF5Gwc/PUHHrGx/CL+TpnPajoRSQMpSbEMvOMXJZt2sOSjTryXkUvImHpyhMHk907gbteWkdDY5PXcTyloheRsBQb7WNW/iiKyqr4+3vbvY7jKRW9iIStc8ZkMnFIGv+3aCOVNfVex/GMil5EwpaZced5o9h7qI7ZS4q9juMZFb2IhLWxA1L5yvj+PL58Czv2VXsdxxMqehEJez88ZwQG3PPyBq+jeEJFLyJhr19qD647ZSjPrdrFqu37vY7T7VT0IhIRrp8yjPSkOH7+4tqIO4hKRS8iESEpLprvnz2cgq37WFhY4nWcbqWiF5GIcal/ICMyk/nlgvXUNkTOOeujvQ4gItJdonzGHeeN4pon3uV7T33A2AEp9EqMpXdiLGmJsfROjKNXYgxJcdGYmddxO42KXkQiymnDM7j8hIHMe38nL63e3eo6sdE+0hIC5Z8US6+PryfGkpYUe8R9aYlxpPaIwecL3jcGC4YvJfx+vysoKPA6hohEEOcc1XWNVByqY++hOvYFLisO1TZfVtWxr/rjZc23K2sbWv1dPms+kVpa4qdvCC0/KbT8tNA7MY60xFhiozs+cm5mK5xz/mOt16EtejNLBR4D8gAHfBPYADwFZAMfAZc65/Z15HlERDqbmZEYF01iXDQD0xLa9Ji6hqbm8q8KlH91HRVVtZ+8WVQEforKqqg41PxG0XSUbenkuGh6JcZy9aTBfOuUoZ34X/afOjp0cy+w0Dn3NTOLBRKAO4DFzrm7zWwWMAv4UQefR0TEc7HRPjJ7xpPZM75N6zc1OQ4crm/xJlBLxaH6Tz81HKojIzmui1N3oOjNrCdwKvB1AOdcHVBnZtOBKYHV5gBLUNGLSATy+YxegWEcT3N04LFDgXLgj2b2vpk9ZmaJQKZzbjdA4LJPJ+QUEZF26kjRRwNfAh5yzo0HDtE8TNMmZjbDzArMrKC8XDPAiIh0lY4U/Q5gh3PuncDtZ2ku/lIz6wsQuCxr7cHOuUecc37nnD8jI6MDMURE5PMr+9XJAAADIUlEQVS0u+idcyXAdjMbEVh0BrAWeB64JrDsGuC5DiUUEZEO6eheNzcDfw3scbMZ+AbNbx5Pm9m1wDbgkg4+h4iIdECHit45twpobWf9Mzrye0VEpPPopGYiImFORS8iEuaC4lw3ZlYObG3nw9OBPZ0YJ9Tp9TiSXo9P6bU4Uji8HoOdc8fcbTEoir4jzKygLSf1iRR6PY6k1+NTei2OFEmvh4ZuRETCnIpeRCTMhUPRP+J1gCCj1+NIej0+pdfiSBHzeoT8GL2IiHy+cNiiFxGRzxHSRW9m08xsg5kVBSY5iVhmNtDMXjOzdWa2xsxmep3Ja2YWFTiF9oteZ/GamaWa2bNmtj7wNzLJ60xeMbNbA/9GCs1srpm1bRaREBayRW9mUcCDQD4wGrjCzEZ7m8pTDcD3nXOjgBOBGyP89QCYCazzOkSQ+Hg2uJHAOCL0dTGz/sAtgN85lwdEAZd7m6rrhWzRAxOBIufc5sDsVn8HpnucyTPOud3OuZWB65U0/0Pu720q75jZAOA8muc0jmgtZoN7HJpng3PO7fc2laeigR5mFk3z9Ke7PM7T5UK56PsD21vc3kEEF1tLZpYNjAfe+fw1w9rvgduAJq+DBIGjzQYXcZxzO4Hf0Hxm3d3AAefcK96m6nqhXPTWyrKI34XIzJKAfwDfdc4d9DqPF8zsfKDMObfC6yxBokOzwYUTM+tF8yf/IUA/INHMrvQ2VdcL5aLfAQxscXsAEfAR7POYWQzNJf9X59w8r/N46CTgQjP7iOYhvalm9qS3kTx1tNngItGZwBbnXLlzrh6YB0z2OFOXC+Wifw/INbMhgYlPLqd5dquIZGZG8xjsOufc77zO4yXn3O3OuQHOuWya/y5edc6F/Vbb0XzObHCRaBtwopklBP7NnEEEfDHd0RmmPOOcazCzm4CXaf7m/Ann3BqPY3npJOAqYLWZrQosu8M5N9/DTBI8WpsNLuI4594xs2eBlTTvqfY+EXCErI6MFREJc6E8dCMiIm2gohcRCXMqehGRMKeiFxEJcyp6EZEwp6IXEQlzKnoRkTCnohcRCXP/Dx7g4K0tl0XtAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.plot(sorted(termFreqs, reverse=True))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We convert the dataframe into a Numpy array which will be used as input for our search function."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[24, 9, 0, 3, 1, 12, 0, 1, 21, 2],\n",
" [32, 5, 3, 0, 0, 2, 0, 0, 10, 0],\n",
" [12, 5, 0, 0, 0, 0, 0, 0, 16, 0],\n",
" [ 6, 2, 0, 0, 0, 0, 0, 0, 7, 2],\n",
" [43, 20, 3, 0, 0, 27, 0, 2, 31, 0],\n",
" [ 2, 0, 7, 16, 33, 0, 18, 0, 0, 27],\n",
" [ 0, 1, 12, 0, 2, 0, 32, 0, 0, 4],\n",
" [ 3, 0, 4, 2, 0, 0, 22, 0, 0, 2],\n",
" [ 1, 0, 27, 25, 7, 0, 34, 3, 0, 11],\n",
" [ 6, 0, 4, 23, 12, 22, 17, 9, 0, 8],\n",
" [ 4, 27, 0, 7, 14, 9, 0, 27, 0, 33],\n",
" [ 0, 14, 1, 12, 5, 4, 0, 7, 0, 16],\n",
" [ 0, 3, 0, 21, 12, 0, 0, 5, 0, 14],\n",
" [ 0, 2, 0, 3, 4, 5, 0, 4, 1, 7],\n",
" [ 0, 11, 0, 2, 0, 3, 0, 4, 0, 3]], dtype=int64)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DTM = np.array(DT)\n",
"DTM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The search function takes a query object (in this case a vector of words), and searches for the K most similar (least distant) items in the data (our index of documents). The \"measure\" parameter allows us to use either the Euclidean distance or the inverse of Cosine similarity as our ditance metric. The function returns the indices of the K most similar neighbors and a list of their distances to the query object."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def knn_search(x, D, K, measure):\n",
" \"\"\" find K nearest neighbors of an instance x among the instances in D \"\"\"\n",
" if measure == 0:\n",
" # euclidean distances from the other points\n",
" dists = np.sqrt(((D - x)**2).sum(axis=1))\n",
" elif measure == 1:\n",
" # first find the vector norm for each instance in D as wel as the norm for vector x\n",
" D_norm = np.array([np.linalg.norm(D[i]) for i in range(len(D))])\n",
" x_norm = np.linalg.norm(x)\n",
" # Compute Cosine: divide the dot product o x and each instance in D by the product of the two norms\n",
" sims = np.dot(D,x)/(D_norm * x_norm)\n",
" # The distance measure will be the inverse of Cosine similarity\n",
" dists = 1 - sims\n",
" idx = np.argsort(dists) # sorting\n",
" # return the indexes of K nearest neighbors\n",
" return idx[:K], dists"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's now try this on a new query object"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 3, 22, 0, 17, 9, 6, 1, 12, 0, 22])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = np.array([3, 22, 0, 17, 9, 6, 1, 12, 0, 22])\n",
"x"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# Finding the k=5 nearest neighbors using inverse of Cosine similarity as a distance metric\n",
"neigh_idx, distances = knn_search(x, DTM, 5, 1)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([11, 10, 13, 14, 12], dtype=int64)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"neigh_idx"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.669527\n",
"1 0.836397\n",
"2 0.818826\n",
"3 0.718808\n",
"4 0.692761\n",
"5 0.386637\n",
"6 0.881295\n",
"7 0.877364\n",
"8 0.603925\n",
"9 0.400426\n",
"10 0.069511\n",
"11 0.007385\n",
"12 0.194400\n",
"13 0.152276\n",
"14 0.172249\n",
"dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"distances = pd.Series(distances, index=DT.index)\n",
"distances"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Query: [ 3 22 0 17 9 6 1 12 0 22]\n",
"\n",
"Neighbors:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
" \n",
" \n",
" 11 | \n",
" 0 | \n",
" 14 | \n",
" 1 | \n",
" 12 | \n",
" 5 | \n",
" 4 | \n",
" 0 | \n",
" 7 | \n",
" 0 | \n",
" 16 | \n",
"
\n",
" \n",
" 10 | \n",
" 4 | \n",
" 27 | \n",
" 0 | \n",
" 7 | \n",
" 14 | \n",
" 9 | \n",
" 0 | \n",
" 27 | \n",
" 0 | \n",
" 33 | \n",
"
\n",
" \n",
" 13 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 4 | \n",
" 1 | \n",
" 7 | \n",
"
\n",
" \n",
" 14 | \n",
" 0 | \n",
" 11 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 4 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" 12 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 21 | \n",
" 12 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 0 | \n",
" 14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9\n",
"11 0 14 1 12 5 4 0 7 0 16\n",
"10 4 27 0 7 14 9 0 27 0 33\n",
"13 0 2 0 3 4 5 0 4 1 7\n",
"14 0 11 0 2 0 3 0 4 0 3\n",
"12 0 3 0 21 12 0 0 5 0 14"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"Query:\", x)\n",
"print(\"\\nNeighbors:\")\n",
"DT.iloc[neigh_idx]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Finding the k=5 nearest neighbors using Euclidean distance metric\n",
"neigh_idx, distances = knn_search(x, DTM, 5, 0)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[11 10 12 14 13]\n"
]
}
],
"source": [
"print(neigh_idx)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 43.231933\n",
"1 47.476310\n",
"2 40.755368\n",
"3 37.536649\n",
"4 63.007936\n",
"5 40.062451\n",
"6 48.959167\n",
"7 42.743421\n",
"8 51.107729\n",
"9 35.651087\n",
"10 22.516660\n",
"11 13.453624\n",
"12 23.345235\n",
"13 30.364453\n",
"14 29.512709\n",
"dtype: float64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"distances = pd.Series(distances, index=DT.index)\n",
"distances"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Query: [ 3 22 0 17 9 6 1 12 0 22]\n",
"\n",
"Neighbors:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
"
\n",
" \n",
" \n",
" \n",
" 11 | \n",
" 0 | \n",
" 14 | \n",
" 1 | \n",
" 12 | \n",
" 5 | \n",
" 4 | \n",
" 0 | \n",
" 7 | \n",
" 0 | \n",
" 16 | \n",
"
\n",
" \n",
" 10 | \n",
" 4 | \n",
" 27 | \n",
" 0 | \n",
" 7 | \n",
" 14 | \n",
" 9 | \n",
" 0 | \n",
" 27 | \n",
" 0 | \n",
" 33 | \n",
"
\n",
" \n",
" 12 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 21 | \n",
" 12 | \n",
" 0 | \n",
" 0 | \n",
" 5 | \n",
" 0 | \n",
" 14 | \n",
"
\n",
" \n",
" 14 | \n",
" 0 | \n",
" 11 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 0 | \n",
" 4 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" 13 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 4 | \n",
" 1 | \n",
" 7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9\n",
"11 0 14 1 12 5 4 0 7 0 16\n",
"10 4 27 0 7 14 9 0 27 0 33\n",
"12 0 3 0 21 12 0 0 5 0 14\n",
"14 0 11 0 2 0 3 0 4 0 3\n",
"13 0 2 0 3 4 5 0 4 1 7"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"Query:\", x)\n",
"print(\"\\nNeighbors:\")\n",
"DT.iloc[neigh_idx]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### Note that the choice of the distance function made a difference in the ranking of the top neighbors (i.e., the top returned documents given the query)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}