515 lines
18 KiB
Plaintext
515 lines
18 KiB
Plaintext
|
|
{
|
||
|
|
"cells": [
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"# Python BeautifulSoup Web Scraping Tutorial\n",
|
||
|
|
"Learn to scrape data from the web using the Python BeautifulSoup bs4 library. \n",
|
||
|
|
"BeautifulSoup makes it easy to parse useful data out of an HTML page. \n",
|
||
|
|
"First install the bs4 library on your system by running at the command line, \n",
|
||
|
|
"*pip install beautifulsoup4* or *easy_install beautifulsoup4* (or bs4) \n",
|
||
|
|
"See [BeautifulSoup official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the complete set of functions.\n",
|
||
|
|
"\n",
|
||
|
|
"### Import requests so we can fetch the html content of the webpage\n",
|
||
|
|
"You can see our example page has about 28k characters."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 61,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"28556\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"import requests\n",
|
||
|
|
"r = requests.get('https://www.usclimatedata.com/climate/united-states/us')\n",
|
||
|
|
"print(len(r.text))"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Import BeautifulSoup, and convert your HTML into a bs4 object\n",
|
||
|
|
"Now we can access specific HTML tags on the page using dot, just like a JSON object."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 4,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"<title>Climate United States - normals and averages</title>\n",
|
||
|
|
"Climate United States - normals and averages\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"from bs4 import BeautifulSoup\n",
|
||
|
|
"soup = BeautifulSoup(r.text)\n",
|
||
|
|
"print(soup.title)\n",
|
||
|
|
"print(soup.title.string)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Drill into the bs4 object to access page contents\n",
|
||
|
|
"soup.p will give you the contents of the first paragraph tag on the page. \n",
|
||
|
|
"soup.a gives you anchors / links on the page. \n",
|
||
|
|
"Get contents of an attribute inside an HTML tag using square brackets and perentheses. \n",
|
||
|
|
"Use .parent to get the parent object, and .next_sibling to get the next peer object. \n",
|
||
|
|
"**Use your browser's *inspect element* feature to find the tag for the data you want.**"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 13,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"<p class=\"breadcrumbs\">You are here: <a href=\"/climate/united-states/us\" title=\"Climate United States\">United States</a></p>\n",
|
||
|
|
"You are here: United States\n",
|
||
|
|
"<a href=\"https://www.facebook.com/yourweatherservice\" title=\"US Climate Data on Facebook\">\n",
|
||
|
|
"<img alt=\"US Climate Data on Facebook\" height=\"16\" src=\"/images/icons/facebook.png\" width=\"16\"/>\n",
|
||
|
|
"</a>\n",
|
||
|
|
"US Climate Data on Facebook\n",
|
||
|
|
"\n",
|
||
|
|
"<ul>\n",
|
||
|
|
"<li>\n",
|
||
|
|
"<a class=\"summary\" href=\"#summary\">Monthly</a>\n",
|
||
|
|
"</li>\n",
|
||
|
|
"<!-- Breadcrumbs -->\n",
|
||
|
|
"<p class=\"breadcrumbs\">You are here: <a href=\"/climate/united-states/us\" title=\"Climate United States\">United States</a></p>\n",
|
||
|
|
"<!-- end Breadcrumbs -->\n",
|
||
|
|
"</ul>\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"print(soup.p)\n",
|
||
|
|
"print(soup.p.text)\n",
|
||
|
|
"print(soup.a)\n",
|
||
|
|
"print(soup.a['title'])\n",
|
||
|
|
"print()\n",
|
||
|
|
"print(soup.p.parent)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Prettify() is handy for formatted printing \n",
|
||
|
|
"but note this works only on bs4 objects, not on strings, dicts or lists. For those you need to import pprint."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 14,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"<ul>\n",
|
||
|
|
" <li>\n",
|
||
|
|
" <a class=\"summary\" href=\"#summary\">\n",
|
||
|
|
" Monthly\n",
|
||
|
|
" </a>\n",
|
||
|
|
" </li>\n",
|
||
|
|
" <!-- Breadcrumbs -->\n",
|
||
|
|
" <p class=\"breadcrumbs\">\n",
|
||
|
|
" You are here:\n",
|
||
|
|
" <a href=\"/climate/united-states/us\" title=\"Climate United States\">\n",
|
||
|
|
" United States\n",
|
||
|
|
" </a>\n",
|
||
|
|
" </p>\n",
|
||
|
|
" <!-- end Breadcrumbs -->\n",
|
||
|
|
"</ul>\n",
|
||
|
|
"\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"print(soup.p.parent.prettify())"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### We need all the state links on this page\n",
|
||
|
|
"First we find_all anchor tags, and print out the href attribute, which is the actual link url. \n",
|
||
|
|
"But we see the result includes some links we don't want, so we need to filter those out."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 7,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"https://www.facebook.com/yourweatherservice\n",
|
||
|
|
"https://twitter.com/usclimatedata\n",
|
||
|
|
"http://www.usclimatedata.com\n",
|
||
|
|
"/climate/united-states/us\n",
|
||
|
|
"#summary\n",
|
||
|
|
"/climate/united-states/us\n",
|
||
|
|
"#\n",
|
||
|
|
"#\n",
|
||
|
|
"/climate/alabama/united-states/3170\n",
|
||
|
|
"/climate/kentucky/united-states/3187\n",
|
||
|
|
"/climate/north-dakota/united-states/3204\n",
|
||
|
|
"/climate/alaska/united-states/3171\n",
|
||
|
|
"/climate/louisiana/united-states/3188\n",
|
||
|
|
"/climate/ohio/united-states/3205\n",
|
||
|
|
"/climate/arizona/united-states/3172\n",
|
||
|
|
"/climate/maine/united-states/3189\n",
|
||
|
|
"/climate/oklahoma/united-states/3206\n",
|
||
|
|
"/climate/arkansas/united-states/3173\n",
|
||
|
|
"/climate/maryland/united-states/1872\n",
|
||
|
|
"/climate/oregon/united-states/3207\n",
|
||
|
|
"/climate/california/united-states/3174\n",
|
||
|
|
"/climate/massachusetts/united-states/3191\n",
|
||
|
|
"/climate/pennsylvania/united-states/3208\n",
|
||
|
|
"/climate/colorado/united-states/3175\n",
|
||
|
|
"/climate/michigan/united-states/3192\n",
|
||
|
|
"/climate/rhode-island/united-states/3209\n",
|
||
|
|
"/climate/connecticut/united-states/3176\n",
|
||
|
|
"/climate/minnesota/united-states/3193\n",
|
||
|
|
"/climate/south-carolina/united-states/3210\n",
|
||
|
|
"/climate/delaware/united-states/3177\n",
|
||
|
|
"/climate/mississippi/united-states/3194\n",
|
||
|
|
"/climate/south-dakota/united-states/3211\n",
|
||
|
|
"/climate/district-of-columbia/united-states/3178\n",
|
||
|
|
"/climate/missouri/united-states/3195\n",
|
||
|
|
"/climate/tennessee/united-states/3212\n",
|
||
|
|
"/climate/florida/united-states/3179\n",
|
||
|
|
"/climate/montana/united-states/919\n",
|
||
|
|
"/climate/texas/united-states/3213\n",
|
||
|
|
"/climate/georgia/united-states/3180\n",
|
||
|
|
"/climate/nebraska/united-states/3197\n",
|
||
|
|
"/climate/utah/united-states/3214\n",
|
||
|
|
"/climate/hawaii/united-states/3181\n",
|
||
|
|
"/climate/nevada/united-states/3198\n",
|
||
|
|
"/climate/vermont/united-states/3215\n",
|
||
|
|
"/climate/idaho/united-states/3182\n",
|
||
|
|
"/climate/new-hampshire/united-states/3199\n",
|
||
|
|
"/climate/virginia/united-states/3216\n",
|
||
|
|
"/climate/illinois/united-states/3183\n",
|
||
|
|
"/climate/new-jersey/united-states/3200\n",
|
||
|
|
"/climate/washington/united-states/3217\n",
|
||
|
|
"/climate/indiana/united-states/3184\n",
|
||
|
|
"/climate/new-mexico/united-states/3201\n",
|
||
|
|
"/climate/west-virginia/united-states/3218\n",
|
||
|
|
"/climate/iowa/united-states/3185\n",
|
||
|
|
"/climate/new-york/united-states/3202\n",
|
||
|
|
"/climate/wisconsin/united-states/3219\n",
|
||
|
|
"/climate/kansas/united-states/3186\n",
|
||
|
|
"/climate/north-carolina/united-states/3203\n",
|
||
|
|
"/climate/wyoming/united-states/3220\n",
|
||
|
|
"https://www.yourweatherservice.com\n",
|
||
|
|
"https://www.climatedata.eu\n",
|
||
|
|
"https://www.weernetwerk.nl\n",
|
||
|
|
"/about-us.php\n",
|
||
|
|
"/disclaimer.php\n",
|
||
|
|
"/cookies.php\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"for link in soup.find_all('a'):\n",
|
||
|
|
" print(link.get('href'))"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Filter urls using string functions\n",
|
||
|
|
"We just add an *if* to check conditions, then add the good ones to a list. \n",
|
||
|
|
"In the end we get 51 state links, including Washington DC."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 15,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"51\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"base_url = 'https://www.usclimatedata.com'\n",
|
||
|
|
"state_links = []\n",
|
||
|
|
"for link in soup.find_all('a'):\n",
|
||
|
|
" url = link.get('href')\n",
|
||
|
|
" if url and '/climate/' in url and '/climate/united-states/us' not in url:\n",
|
||
|
|
" state_links.append(url)\n",
|
||
|
|
"print(len(state_links))"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Test getting the data for one state\n",
|
||
|
|
"then print the title for that page."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 18,
|
||
|
|
"metadata": {
|
||
|
|
"scrolled": true
|
||
|
|
},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"Climate Ohio - temperature, rainfall and average\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"r = requests.get(base_url + state_links[5])\n",
|
||
|
|
"soup = BeautifulSoup(r.text)\n",
|
||
|
|
"print(soup.title.string)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### The data we need is in *tr* tags\n",
|
||
|
|
"But look, there are 58 tr tags on the page, and we only want 2 of them - the *Average high* rows."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 37,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"58\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"rows = soup.find_all('tr')\n",
|
||
|
|
"print(len(rows))"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Filter rows, and add temp data to a list\n",
|
||
|
|
"We use a list comprehension to filter the rows. \n",
|
||
|
|
"Then we have only 2 rows left. \n",
|
||
|
|
"We iterate through those 2 rows, and add all the temps from data cells (td) into a list."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 50,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"2\n",
|
||
|
|
"['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41']\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"rows = [row for row in rows if 'Average high' in str(row)]\n",
|
||
|
|
"print(len(rows))\n",
|
||
|
|
"\n",
|
||
|
|
"high_temps = []\n",
|
||
|
|
"for row in rows:\n",
|
||
|
|
" tds = row.find_all('td')\n",
|
||
|
|
" for i in range(1,7):\n",
|
||
|
|
" high_temps.append(tds[i].text)\n",
|
||
|
|
"print(high_temps)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Get the name of the State\n",
|
||
|
|
"First attempt we just split the title string into a list, and grab the second word. \n",
|
||
|
|
"But that doesn't work for 2-word states like New York and North Carolina. \n",
|
||
|
|
"So instead we slice the string from first blank to the hyphen. "
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 56,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"Wyoming\n",
|
||
|
|
"Wyoming\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"state = soup.title.string.split()[1]\n",
|
||
|
|
"print(state)\n",
|
||
|
|
"s = soup.title.string\n",
|
||
|
|
"state = s[s.find(' '):s.find('-')].strip()\n",
|
||
|
|
"print(state)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Add state name and temp list to the data dictionary\n",
|
||
|
|
"For a single state, this is what our scraped data looks like. \n",
|
||
|
|
"In this example we only got monthly highs by state, but you could drill into cities, and could get lows and precipitation. "
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 51,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"{'Ohio': ['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41']}\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"data = {}\n",
|
||
|
|
"data[state] = high_temps\n",
|
||
|
|
"print(data)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Put it all together and iterate 51 states\n",
|
||
|
|
"We loop through our 51-state list, and get high temp data for each state, and add it to the data dict. \n",
|
||
|
|
"This combines all our work above into a single for loop. \n",
|
||
|
|
"The result is a dict with 51 states and a list of monthly highs for each."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 59,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"{'Alabama': ['57', '62', '70', '77', '84', '90', '92', '92', '87', '78', '69', '60'], 'Kentucky': ['40', '45', '55', '66', '75', '83', '87', '86', '79', '68', '55', '44'], 'North Dakota': ['23', '28', '40', '57', '68', '77', '85', '83', '72', '58', '40', '26'], 'Alaska': ['23', '27', '34', '44', '56', '63', '65', '64', '55', '40', '28', '25'], 'Louisiana': ['62', '65', '72', '78', '85', '89', '91', '91', '87', '80', '72', '64'], 'Ohio': ['36', '40', '52', '63', '73', '82', '85', '84', '77', '65', '52', '41'], 'Arizona': ['67', '71', '77', '85', '95', '104', '106', '104', '100', '89', '76', '66'], 'Maine': ['28', '32', '40', '53', '65', '74', '79', '78', '70', '57', '45', '33'], 'Oklahoma': ['50', '55', '63', '72', '80', '88', '94', '93', '85', '73', '62', '51'], 'Arkansas': ['51', '55', '64', '73', '81', '89', '92', '93', '86', '75', '63', '52'], 'Maryland': ['42', '46', '54', '65', '75', '85', '89', '87', '80', '68', '58', '46'], 'Oregon': ['48', '52', '56', '61', '68', '74', '82', '82', '77', '64', '53', '46'], 'California': ['54', '60', '65', '71', '80', '87', '92', '91', '87', '78', '64', '54'], 'Massachusetts': ['36', '39', '45', '56', '66', '76', '81', '80', '72', '61', '51', '41'], 'Pennsylvania': ['40', '44', '53', '64', '74', '83', '87', '85', '78', '67', '56', '45'], 'Colorado': ['45', '46', '54', '61', '72', '82', '90', '88', '79', '66', '52', '45'], 'Michigan': ['30', '33', '44', '58', '69', '78', '82', '80', '73', '60', '47', '34'], 'Rhode Island': ['37', '40', '48', '59', '68', '78', '83', '81', '74', '63', '53', '42'], 'Connecticut': ['37', '40', '47', '58', '68', '77', '82', '81', '74', '63', '53', '42'], 'Minnesota': ['26', '31', '43', '58', '71', '80', '85', '82', '73', '59', '42', '29'], 'South Carolina': ['56', '60', '68', '76', '84', '90', '93', '91', '85', '76', '67', '58'], 'Delaware': ['43', '47', '55', '66', '75', '83', '87', '85', '79', '69', '58', '47'], 'Mississippi': ['56', '60', '69', '76', '83', '89', '92', '92', '87', '77', '67', '58'], 'South Dakota': ['22', '27', '39', '57', '69', '78', '84', '82', '72', '58', '39', '25'], 'District of Columbia': ['42', '44', '53', '64', '75', '83', '87', '84', '78', '67', '55', '45'], 'Missouri': ['40', '45', '56', '67', '75', '83', '88', '88', '80', '69', '56', '43'], 'Tennessee': ['50', '55', '64', '73', '81', '89', '92', '91', '85', '74', '63', '52'], 'Florida': ['64', '67', '74', '80', '87', '91', '92', '92', '88', '81', '73', '65'], 'Montana': ['33', '39', '48', '58', '67', '76', '86', '85', '73', '59', '43', '32'], 'Texas': ['62', '65', '72', '80', '87', '92', '96', '97', '91', '82', '71', '63'], 'Georgia': ['52', '57', '64', '72', '81', '86', '90', '88', '82', '73', '64', '54'], 'Nebraska': ['32', '37', '50', '63', '73', '84', '88', '86', '77', '64', '48', '36'], 'Utah': ['38', '44', '53', '61', '71', '82', '90', '89', '78', '65', '50', '40'], 'Hawaii': ['80', '80', '81', '83', '85', '87', '88', '89', '89', '87', '84', '81'], 'Nevada': ['45', '50', '57', '63', '71', '81', '90', '88', '80', '68', '54', '45'], 'Vermont': ['27', '31', '40', '55', '67', '76', '81', '79', '70', '57', '46', '33'], 'Idaho': ['38', '45', '55', '62', '72', '81', '91', '90', '79', '65', '48', '38'], 'New Hampshire': ['31', '35', '44', '57', '69', '77', '82', '81', '73', '60', '48', '36'], 'Virginia': ['47', '51', '60', '70', '78', '86', '90', '88', '81', '71', '61', '51'], 'Illinois': ['32', '36', '46', '59', '70', '81', '84', '82', '75', '63', '48', '36'], 'New Jersey': ['39', '42', '51', '62', '72', '82', '86', '84', '77', '65', '55', '44'], 'Washington': ['47', '50', '54', '58', '65', '70', '76', '76', '71', '60', '51', '46'], 'Indiana': ['35', '40', '51', '63', '73', '82', '85', '83', '77', '65', '52', '39'], 'New Mexico': ['44', '48', '56', '65', '74', '83', '86', '83', '78', '67', '53', '43'], 'West Virginia': ['42', '47', '56', '68', '75', '82', '85', '84', '78', '68', '57', '46'], 'Iowa': ['31', '36', '49', '62', '72', '82', '86', '84', '76', '63', '48', '34'], 'New York': ['39', '42', '50', '60', '71', '79', '85', '83', '76', '65', '54', '44'], 'Wisconsin'
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"data = {}\n",
|
||
|
|
"for state_link in state_links:\n",
|
||
|
|
" url = base_url + state_link\n",
|
||
|
|
" r = requests.get(base_url + state_link)\n",
|
||
|
|
" soup = BeautifulSoup(r.text)\n",
|
||
|
|
" rows = soup.find_all('tr')\n",
|
||
|
|
" rows = [row for row in rows if 'Average high' in str(row)]\n",
|
||
|
|
" high_temps = []\n",
|
||
|
|
" for row in rows:\n",
|
||
|
|
" tds = row.find_all('td')\n",
|
||
|
|
" for i in range(1,7):\n",
|
||
|
|
" high_temps.append(tds[i].text)\n",
|
||
|
|
" s = soup.title.string\n",
|
||
|
|
" state = s[s.find(' '):s.find('-')].strip()\n",
|
||
|
|
" data[state] = high_temps\n",
|
||
|
|
"print(data)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"### Save to CSV file\n",
|
||
|
|
"Lastly, we might want to write all this data to a CSV file. \n",
|
||
|
|
"Here's a quick easy way to do that."
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 60,
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [],
|
||
|
|
"source": [
|
||
|
|
"import csv\n",
|
||
|
|
"\n",
|
||
|
|
"with open('high_temps.csv','w') as f:\n",
|
||
|
|
" w = csv.writer(f)\n",
|
||
|
|
" w.writerows(data.items())"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"metadata": {
|
||
|
|
"kernelspec": {
|
||
|
|
"display_name": "Python 3",
|
||
|
|
"language": "python",
|
||
|
|
"name": "python3"
|
||
|
|
},
|
||
|
|
"language_info": {
|
||
|
|
"codemirror_mode": {
|
||
|
|
"name": "ipython",
|
||
|
|
"version": 3
|
||
|
|
},
|
||
|
|
"file_extension": ".py",
|
||
|
|
"mimetype": "text/x-python",
|
||
|
|
"name": "python",
|
||
|
|
"nbconvert_exporter": "python",
|
||
|
|
"pygments_lexer": "ipython3",
|
||
|
|
"version": "3.7.0"
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"nbformat": 4,
|
||
|
|
"nbformat_minor": 2
|
||
|
|
}
|