Hello Everyone. Its been a long time since i last wrote in this blog. Now I’m back with new emerging and trending concept Data Mining and Visualization. So lets get started.
“Data” – The most important asset for any company since their future is based on this. Not only companies even countries, continents or to be precise the entire world is running behind it. In this social world tons of data is being stored everyday. So whats the use if we store and do nothing with it. There is a hidden treasure in that which shapes our future and economy in a better way. In the process of finding it many algorithms were discovered that restructures this data into a meaningful format. Here is the first look.
Now i think you got an idea what it can exactly do. Yes, you’re right “A sculptor makes a sculpture and in the same way a programmer makes a sculpture that has an affect on the world”. Today if you are good at shaping the data you are worth more than a million. A company or organization or an individual needs the one that can predict the future with respect time which in turn yields some good crop. So lets get started into the world of data and the different ways of processing.
In present scenario data is unstructured. It’s present all over the world and as a Data Scientist you need to gather this unstructured information and turn it into a meaningful structured information. How to do this? A big question that might hit our brain. To get an answer to this just think about your brain that is storing information from your childhood. Did you ever design any schema for your brain to store information or did you mention any relationships between the information. Certainly the answer is “No”. But how are retrieving the information when required. Now lets go a little deeper. If you carefully observe you are unknowingly running an algorithm that is identifying a pattern in the information available and thereby it is structuring it into meaningful format. Here the keyword to note is “PATTERN”. Yes, this makes it. For any unstructured data you need to identify this so called pattern and make the outcome possible.
There are different ways to identify these and we will discuss this in the coming days.
There are two types of data.
1. Structured Data.
2. Unstructured Data.
Firstly lets discuss about structured data. Data that resides within a record or a file is called structured data. It includes data that is contained in databases and spreadsheets. Analyzing structured data is something that you know what to do before you even start to analyze. In simple terms the answer is given right before you attempt it. Your final goal is to get that answer. For example if a grocery store owner asks you to predict the time at which the sale for a particular item is high. The answer is quite simple, given the data set analyze the data with your algorithm and predict the output required. Here you know “what to do” before and now your task is “how to do”. Just as simple as that!
Now what are the steps involved in this. It’s quite a easy.
1. Scrape the data.
2. Analyze the data.
3. Visualize the data in required format.
To do this there are lot of tools. Basically i prefer “python” since it has vast libraries when compared to others.
Initially download python 2.7.3from here. This is the standard and preferred version.
Now you are given a URL to analyze the information and predict the respective outcome. The first step to do is scrape the data from URL. Below i provide a sample code for a particular URL. Use that to download the URL of your choice. For this program you need to install lxml a standard library that effectively process XML as well as HTML even if it is badly formatted and return the structure as tree.
Scrapes rainfall data from
– This URL has a list of States as a href tag that is in between list tag
– Each state has districts as a href tag that is in between list tag
– Each district has a text file that looks like this:
YEAR JANUARY FEBRUARY MARCH APRIL MAY JUNE JULY AUGUST SEPTEMBER OCTOBER NOVEMBER DECEMBER
R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP. R/F %DEP.
2006 0.0 -100 0.0 -100 89.8 755 54.3 296 27.5 23 93.9 -39 182.2 -31 450.8 98 503.8 194 15.6 -80 93.1 406 0.0 -100
2007 0.0 -100 2.0 -60 0.1 -99 12.4 -9 33.2 48 183.8 19 128.3 -52 225.9
from lxml import etree
'''Retrives a URL'''
sys.stderr.write(url + '\n')
'''Strips out all unsafe / non-ASCII characters'''
return ' '.join(char for char in line if 32 <= ord(char) < 128)
# The output will be written to a CSV file with these fields
# You could output to JSON too, if you wish
out = csv.writer(sys.stdout, lineterminator='\n')
out.writerow(('state', 'district', 'year',
# Download the main URL
url = "http://www.imd.gov.in/section/hydro/distrainfall/districtrain.html"
tree = etree.HTML(get(url))
# We loop through those...
for state in tree.findall('.//li/a'):
state_url = urlparse.urljoin(url, state.get('href'))
tree2 = etree.HTML(get(state_url))
# We loop through those
for district in tree2.findall('.//li/a'):
district_url = urlparse.urljoin(state_url, district.get('href'))
text = get(district_url)
# Process each line in the text
for line in text.split('\n'):
# Find the column boundaries
if line.find('R/F %DEP.') >= 0:
pos = [0, 4]
for i, char in enumerate(line):
if ord(char) 0 and ord(line[i-1]) > 32:
# Process only the lines that start with a year: 2006, 2007, etc.
line = safe(line.strip())
if not line.startswith('20'): continue
values = [line[pos[i]:pos[i+1]].strip() for i, p in enumerate(pos[:-1])]
The above program scrapes the data from the URL and prints the scraped data to the console. Trace the program completely and try to understand carefully about scraping the data.
Note: Instead of using sys.stdout in csv.writer you can directly open a file using open(“filename”, mode) function available in python.