Files
Python/Web Data Mining/Python BeautifulSoup Web Scraping Tutorial.ipynb
2019-03-18 20:14:08 -07:00

515 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python BeautifulSoup Web Scraping Tutorial\n",
"Learn to scrape data from the web using the Python BeautifulSoup bs4 library. \n",
"BeautifulSoup makes it easy to parse useful data out of an HTML page. \n",
"First install the bs4 library on your system by running at the command line, \n",
"*pip install beautifulsoup4* or *easy_install beautifulsoup4* (or bs4) \n",
"See [BeautifulSoup official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the complete set of functions.\n",
"\n",
"### Import requests so we can fetch the html content of the webpage\n",
"You can see our example page has about 28k characters."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"28556\n"
]
}
],
"source": [
"import requests\n",
"r = requests.get('https://www.usclimatedata.com/climate/united-states/us')\n",
"print(len(r.text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import BeautifulSoup, and convert your HTML into a bs4 object\n",
"Now we can access specific HTML tags on the page using dot, just like a JSON object."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<title>Climate United States - normals and averages</title>\n",
"Climate United States - normals and averages\n"
]
}
],
"source": [
"from bs4 import BeautifulSoup\n",
"soup = BeautifulSoup(r.text)\n",
"print(soup.title)\n",
"print(soup.title.string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Drill into the bs4 object to access page contents\n",
"soup.p will give you the contents of the first paragraph tag on the page. \n",
"soup.a gives you anchors / links on the page. \n",
"Get contents of an attribute inside an HTML tag using square brackets and perentheses. \n",
"Use .parent to get the parent object, and .next_sibling to get the next peer object. \n",
"**Use your browser's *inspect element* feature to find the tag for the data you want.**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<p class=\"breadcrumbs\">You are here: <a href=\"/climate/united-states/us\" title=\"Climate United States\">United States</a></p>\n",
"You are here: United States\n",
"<a href=\"https://www.facebook.com/yourweatherservice\" title=\"US Climate Data on Facebook\">\n",
"<img alt=\"US Climate Data on Facebook\" height=\"16\" src=\"/images/icons/facebook.png\" width=\"16\"/>\n",
"</a>\n",
"US Climate Data on Facebook\n",
"\n",
"<ul>\n",
"<li>\n",
"<a class=\"summary\" href=\"#summary\">Monthly</a>\n",
"</li>\n",
"<!-- Breadcrumbs -->\n",
"<p class=\"breadcrumbs\">You are here: <a href=\"/climate/united-states/us\" title=\"Climate United States\">United States</a></p>\n",
"<!-- end Breadcrumbs -->\n",
"</ul>\n"
]
}
],
"source": [
"print(soup.p)\n",
"print(soup.p.text)\n",
"print(soup.a)\n",
"print(soup.a['title'])\n",
"print()\n",
"print(soup.p.parent)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prettify() is handy for formatted printing \n",
"but note this works only on bs4 objects, not on strings, dicts or lists. For those you need to import pprint."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<ul>\n",
" <li>\n",
" <a class=\"summary\" href=\"#summary\">\n",
" Monthly\n",
" </a>\n",
" </li>\n",
" <!-- Breadcrumbs -->\n",
" <p class=\"breadcrumbs\">\n",
" You are here:\n",
" <a href=\"/climate/united-states/us\" title=\"Climate United States\">\n",
" United States\n",
" </a>\n",
" </p>\n",
" <!-- end Breadcrumbs -->\n",
"</ul>\n",
"\n"
]
}
],
"source": [
"print(soup.p.parent.prettify())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### We need all the state links on this page\n",
"First we find_all anchor tags, and print out the href attribute, which is the actual link url. \n",
"But we see the result includes some links we don't want, so we need to filter those out."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://www.facebook.com/yourweatherservice\n",
"https://twitter.com/usclimatedata\n",
"http://www.usclimatedata.com\n",
"/climate/united-states/us\n",
"#summary\n",
"/climate/united-states/us\n",
"#\n",
"#\n",
"/climate/alabama/united-states/3170\n",
"/climate/kentucky/united-states/3187\n",
"/climate/north-dakota/united-states/3204\n",
"/climate/alaska/united-states/3171\n",
"/climate/louisiana/united-states/3188\n",
"/climate/ohio/united-states/3205\n",
"/climate/arizona/united-states/3172\n",
"/climate/maine/united-states/3189\n",
"/climate/oklahoma/united-states/3206\n",
"/climate/arkansas/united-states/3173\n",
"/climate/maryland/united-states/1872\n",
"/climate/oregon/united-states/3207\n",
"/climate/california/united-states/3174\n",
"/climate/massachusetts/united-states/3191\n",
"/climate/pennsylvania/united-states/3208\n",
"/climate/colorado/united-states/3175\n",
"/climate/michigan/united-states/3192\n",
"/climate/rhode-island/united-states/3209\n",
"/climate/connecticut/united-states/3176\n",
"/climate/minnesota/united-states/3193\n",
"/climate/south-carolina/united-states/3210\n",
"/climate/delaware/united-states/3177\n",
"/climate/mississippi/united-states/3194\n",
"/climate/south-dakota/united-states/3211\n",
"/climate/district-of-columbia/united-states/3178\n",
"/climate/missouri/united-states/3195\n",
"/climate/tennessee/united-states/3212\n",
"/climate/florida/united-states/3179\n",
"/climate/montana/united-states/919\n",
"/climate/texas/united-states/3213\n",
"/climate/georgia/united-states/3180\n",
"/climate/nebraska/united-states/3197\n",
"/climate/utah/united-states/3214\n",
"/climate/hawaii/united-states/3181\n",
"/climate/nevada/united-states/3198\n",
"/climate/vermont/united-states/3215\n",
"/climate/idaho/united-states/3182\n",
"/climate/new-hampshire/united-states/3199\n",
"/climate/virginia/united-states/3216\n",
"/climate/illinois/united-states/3183\n",
"/climate/new-jersey/united-states/3200\n",
"/climate/washington/united-states/3217\n",
"/climate/indiana/united-states/3184\n",
"/climate/new-mexico/united-states/3201\n",
"/climate/west-virginia/united-states/3218\n",
"/climate/iowa/united-states/3185\n",
"/climate/new-york/united-states/3202\n",
"/climate/wisconsin/united-states/3219\n",
"/climate/kansas/united-states/3186\n",
"/climate/north-carolina/united-states/3203\n",
"/climate/wyoming/united-states/3220\n",
"https://www.yourweatherservice.com\n",
"https://www.climatedata.eu\n",
"https://www.weernetwerk.nl\n",
"/about-us.php\n",
"/disclaimer.php\n",
"/cookies.php\n"
]
}
],
"source": [
"for link in soup.find_all('a'):\n",
" print(link.get('href'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filter urls using string functions\n",
"We just add an *if* to check conditions, then add the good ones to a list. \n",
"In the end we get 51 state links, including Washington DC."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"51\n"
]
}
],
"source": [
"base_url = 'https://www.usclimatedata.com'\n",
"state_links = []\n",
"for link in soup.find_all('a'):\n",
" url = link.get('href')\n",
" if url and '/climate/' in url and '/climate/united-states/us' not in url:\n",
" state_links.append(url)\n",
"print(len(state_links))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test getting the data for one state\n",
"then print the title for that page."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Climate Ohio - temperature, rainfall and average\n"
]
}
],
"source": [
"r = requests.get(base_url + state_links[5])\n",
"soup = BeautifulSoup(r.text)\n",
"print(soup.title.string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The data we need is in *tr* tags\n",
"But look, there are 58 tr tags on the page, and we only want 2 of them - the *Average high* rows."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"58\n"
]
}
],
"source": [
"rows = soup.find_all('tr')\n",
"print(len(rows))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filter rows, and add temp data to a list\n",
"We use a list comprehension to filter the rows. \n",
"Then we have only 2 rows left. \n",
"We iterate through those 2 rows, and add all the temps from data cells (td) into a list."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2\n",
"['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41']\n"
]
}
],
"source": [
"rows = [row for row in rows if 'Average high' in str(row)]\n",
"print(len(rows))\n",
"\n",
"high_temps = []\n",
"for row in rows:\n",
" tds = row.find_all('td')\n",
" for i in range(1,7):\n",
" high_temps.append(tds[i].text)\n",
"print(high_temps)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get the name of the State\n",
"First attempt we just split the title string into a list, and grab the second word. \n",
"But that doesn't work for 2-word states like New York and North Carolina. \n",
"So instead we slice the string from first blank to the hyphen. "
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyoming\n",
"Wyoming\n"
]
}
],
"source": [
"state = soup.title.string.split()[1]\n",
"print(state)\n",
"s = soup.title.string\n",
"state = s[s.find(' '):s.find('-')].strip()\n",
"print(state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Add state name and temp list to the data dictionary\n",
"For a single state, this is what our scraped data looks like. \n",
"In this example we only got monthly highs by state, but you could drill into cities, and could get lows and precipitation. "
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Ohio': ['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41']}\n"
]
}
],
"source": [
"data = {}\n",
"data[state] = high_temps\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Put it all together and iterate 51 states\n",
"We loop through our 51-state list, and get high temp data for each state, and add it to the data dict. \n",
"This combines all our work above into a single for loop. \n",
"The result is a dict with 51 states and a list of monthly highs for each."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Alabama': ['57', '62', '70', '77', '84', '90', '92', '92', '87', '78', '69', '60'], 'Kentucky': ['40', '45', '55', '66', '75', '83', '87', '86', '79', '68', '55', '44'], 'North Dakota': ['23', '28', '40', '57', '68', '77', '85', '83', '72', '58', '40', '26'], 'Alaska': ['23', '27', '34', '44', '56', '63', '65', '64', '55', '40', '28', '25'], 'Louisiana': ['62', '65', '72', '78', '85', '89', '91', '91', '87', '80', '72', '64'], 'Ohio': ['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41'], 'Arizona': ['67', '71', '77', '85', '95', '104', '106', '104', '100', '89', '76', '66'], 'Maine': ['28', '32', '40', '53', '65', '74', '79', '78', '70', '57', '45', '33'], 'Oklahoma': ['50', '55', '63', '72', '80', '88', '94', '93', '85', '73', '62', '51'], 'Arkansas': ['51', '55', '64', '73', '81', '89', '92', '93', '86', '75', '63', '52'], 'Maryland': ['42', '46', '54', '65', '75', '85', '89', '87', '80', '68', '58', '46'], 'Oregon': ['48', '52', '56', '61', '68', '74', '82', '82', '77', '64', '53', '46'], 'California': ['54', '60', '65', '71', '80', '87', '92', '91', '87', '78', '64', '54'], 'Massachusetts': ['36', '39', '45', '56', '66', '76', '81', '80', '72', '61', '51', '41'], 'Pennsylvania': ['40', '44', '53', '64', '74', '83', '87', '85', '78', '67', '56', '45'], 'Colorado': ['45', '46', '54', '61', '72', '82', '90', '88', '79', '66', '52', '45'], 'Michigan': ['30', '33', '44', '58', '69', '78', '82', '80', '73', '60', '47', '34'], 'Rhode Island': ['37', '40', '48', '59', '68', '78', '83', '81', '74', '63', '53', '42'], 'Connecticut': ['37', '40', '47', '58', '68', '77', '82', '81', '74', '63', '53', '42'], 'Minnesota': ['26', '31', '43', '58', '71', '80', '85', '82', '73', '59', '42', '29'], 'South Carolina': ['56', '60', '68', '76', '84', '90', '93', '91', '85', '76', '67', '58'], 'Delaware': ['43', '47', '55', '66', '75', '83', '87', '85', '79', '69', '58', '47'], 'Mississippi': ['56', '60', '69', '76', '83', '89', '92', '92', '87', '77', '67', '58'], 'South Dakota': ['22', '27', '39', '57', '69', '78', '84', '82', '72', '58', '39', '25'], 'District of Columbia': ['42', '44', '53', '64', '75', '83', '87', '84', '78', '67', '55', '45'], 'Missouri': ['40', '45', '56', '67', '75', '83', '88', '88', '80', '69', '56', '43'], 'Tennessee': ['50', '55', '64', '73', '81', '89', '92', '91', '85', '74', '63', '52'], 'Florida': ['64', '67', '74', '80', '87', '91', '92', '92', '88', '81', '73', '65'], 'Montana': ['33', '39', '48', '58', '67', '76', '86', '85', '73', '59', '43', '32'], 'Texas': ['62', '65', '72', '80', '87', '92', '96', '97', '91', '82', '71', '63'], 'Georgia': ['52', '57', '64', '72', '81', '86', '90', '88', '82', '73', '64', '54'], 'Nebraska': ['32', '37', '50', '63', '73', '84', '88', '86', '77', '64', '48', '36'], 'Utah': ['38', '44', '53', '61', '71', '82', '90', '89', '78', '65', '50', '40'], 'Hawaii': ['80', '80', '81', '83', '85', '87', '88', '89', '89', '87', '84', '81'], 'Nevada': ['45', '50', '57', '63', '71', '81', '90', '88', '80', '68', '54', '45'], 'Vermont': ['27', '31', '40', '55', '67', '76', '81', '79', '70', '57', '46', '33'], 'Idaho': ['38', '45', '55', '62', '72', '81', '91', '90', '79', '65', '48', '38'], 'New Hampshire': ['31', '35', '44', '57', '69', '77', '82', '81', '73', '60', '48', '36'], 'Virginia': ['47', '51', '60', '70', '78', '86', '90', '88', '81', '71', '61', '51'], 'Illinois': ['32', '36', '46', '59', '70', '81', '84', '82', '75', '63', '48', '36'], 'New Jersey': ['39', '42', '51', '62', '72', '82', '86', '84', '77', '65', '55', '44'], 'Washington': ['47', '50', '54', '58', '65', '70', '76', '76', '71', '60', '51', '46'], 'Indiana': ['35', '40', '51', '63', '73', '82', '85', '83', '77', '65', '52', '39'], 'New Mexico': ['44', '48', '56', '65', '74', '83', '86', '83', '78', '67', '53', '43'], 'West Virginia': ['42', '47', '56', '68', '75', '82', '85', '84', '78', '68', '57', '46'], 'Iowa': ['31', '36', '49', '62', '72', '82', '86', '84', '76', '63', '48', '34'], 'New York': ['39', '42', '50', '60', '71', '79', '85', '83', '76', '65', '54', '44'], 'Wisconsin': ['29', '33', '42', '54', '65', '75', '80', '78', '71', '59', '46', '33'], 'Kansas': ['40', '45', '56', '67', '76', '85', '89', '89', '80', '68', '55', '42'], 'North Carolina': ['51', '55', '63', '72', '79', '86', '89', '87', '81', '72', '62', '53'], 'Wyoming': ['40', '40', '47', '55', '65', '75', '83', '81', '72', '59', '47', '38']}\n"
]
}
],
"source": [
"data = {}\n",
"for state_link in state_links:\n",
" url = base_url + state_link\n",
" r = requests.get(base_url + state_link)\n",
" soup = BeautifulSoup(r.text)\n",
" rows = soup.find_all('tr')\n",
" rows = [row for row in rows if 'Average high' in str(row)]\n",
" high_temps = []\n",
" for row in rows:\n",
" tds = row.find_all('td')\n",
" for i in range(1,7):\n",
" high_temps.append(tds[i].text)\n",
" s = soup.title.string\n",
" state = s[s.find(' '):s.find('-')].strip()\n",
" data[state] = high_temps\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save to CSV file\n",
"Lastly, we might want to write all this data to a CSV file. \n",
"Here's a quick easy way to do that."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"\n",
"with open('high_temps.csv','w') as f:\n",
" w = csv.writer(f)\n",
" w.writerows(data.items())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}