Web data extraction for your small business. (Data Science Summit, Kolkata)

Anirban

11 years ago

Post written: Dipayan Dev

Do you aspire to become an entrepreneur? Does it look challenging to start up? It might, but a well-structured plan put in place would make your task easier. Often our Plan A does not work, plan B gets executed to reach targets without failing time.

Performing beyond our capacity to cope with the ever rising competition is in the DNA of successful entrepreneurs. Using the best of technologies, like Web Data Extraction, to beat competition, is a part of the game plan.

Web Data extraction or Web Crawling: It is a process of fetching information from any web page. Web crawling basically refers to Big Data, where a crawler or a bot (set up) extracts data from the deepest core of a web page. The crawler should have the ability to crawl the fields, which at times, does not reside in the page source.

You can create your crawler, automated extraction engine, i.e. wrapper generator or you can manually run your extraction process using the traditional wrapper induction algorithms to serve this purpose. One of basic element of extracting the data from web is called XPath. Here is a link, which has explanation for a traditional method uses to crawl any HTML or XML file.

Are we getting too technical here?

Let’s continue.

Never in the history of mankind, did we have so much data recorded and stored in an accessible manner. I’ll be the geek for some time now.

Do you know: 2.4 million Facebook posts are made per minute (2014 stats)?

That’s just Facebook, 4 million Google searches are made, 204 million emails are sent, 2.8 Lakh tweets are posted per minute!

The size of the World Wide Web (www) has reached to the exa-byte scale in the last couple of years. But most of the data are incoherent and thus, tough to access. A bulk of the data is spread across individual blog sites, news portals, ecommerce sites and other portals. Each of these sites have their own structure and their own unique HTML tag trees. This makes building an automated extraction system, slightly tough. Not impossible though.

We have “wrappers” to perform such automatic extraction of data. It is defined as a procedure that translates content from specific information source into a relational model, converts the unstructured data to a structured format. Am I being too geeky?

But like we said, we could possibly create one for a website. But since every website has their own structure, it is inconvenient and we need tool which would create “wrappers”. They are called “Wrapper generator”.

The first thing we do with the data is to contextualise them. Then it becomes ready for usage. Usage ranges from – using the information given in the data to sentiment analysis.

Most young start-ups do not have such tools. But there are different web crawling companies to make their life easy.

Why would a start-up need it?

a. Understand your target group:

Know your target market’s online behaviour, search patterns, buying behaviour, sentiment while interaction, contact details and segmenting it on the basis of age, gender, demographics, income groups, taste, preference etc

b. Lower marketing cost, higher conversion rate:

Using this technology, you can filter your audience. This might also reduce your marketing efforts and the cost involved, increases conversion rate and offers you better return on investment.

c. Understand your competitors:

Know what product pricing they are coming up with, compare that with other portals globally. Check the quality they offer the price and their user feedback.

d. Better branding: Reaching out to a niche relevant audience to start your promotion, getting feedback, building rapport becoming the market leader in the niche, before diversifying would only enhance one’s brand.

Web Data Extraction definitely has turned into an irreplaceable hand for a start-up. Your small steps could make your small business big.

Data Science Summit, 28 August 2015. The Park, Kolkata. For more details and RSVP, check the site.

This was an attempt to make Big data appear slightly small, for you to take an interest in. If you are a student interested in big data or a business man reading this, and willing to know how your business could hold hands of the newest technologies and increase revenue, there is something BIG for you.

Data Science Foundation in association with NASSCOM is organising “Data Science Summit, Kolkata 2015” on 28th August 2015 at The Park, Kolkata. Something of this sort and magnitude is happening for the first time in Kolkata and we are happy that Kolkata Bloggers is a happy part of this event. We have the best of the speakers in this region coming together to raise awareness of Big Data. Do check out their website and book your calendar as soon as possible.

—

Author details:

Dipayan Dev works in Prompt Cloud Technologies as a software engineer. He did his masters from National Institute of Technology, Silchar in computer science engineering. His research interests include Developing new algorithms in large-scale data, key -value store, data management etc. Dipayan has authored various research papers which are published by IEEE and Springer. He was my college junior and we shared the apartment together during my last year of college. [Facebook link]

Author
Recent Posts

Anirban

Associate Principal, Responsible AI at Accenture

I am probably an explorer, with a keen interest in contributing to wildlife conservation. You could follow me on Instagram (@sahaanirban) for more photographs.