Scraping data off arabnet startup database for a better understanding of the MENA startup scene

Posted on March 8, 2012 in Events/conferences, Tutorials

Since I am an arabnet official blogger and all – I was checking out something interesting to blog about and I ran into the startup database http://startupdb.arabnet.me and i thought it would be interesting to actually get some statistics about the start up scene and why not visualize it too.

It is really important data for business decisions and no one can fail to see its value.

This is why I scrapped it off the arabnet website using excel – no, i did not hack into their database and steal it – i scrapped it off – here is a wikipedia link to explain data scrapping and here is how you do it.

      1. Launch excel
      2. Go to Data tab
      3. Choose from web
      4. Enter the following address: http://startupdb.arabnet.me/startup/2/
      5. Click import into a new worksheet (new worksheet is important unless you want all your data squashed together)
      6. Wait a bit
      7. If you are in Lebanon, wait some more for ontornet

As you can see, the data is now in excel – now you need to automate this process from the url http://startupdb.arabnet.me/startup/2/  till max number of records – which i guessed it to be 3000

In Excel, you can do that via a Macro… now delete the sheet you just created. Click the little record macro button at the bottom of the excel workbook, just under sheets, it is a little square with a red dot in it.

      • Repeat the steps above – but while macro being recorded – once done click stop macro
      • Go to view – Macros – Edit the macro you just recorded it to contain a loop.

Here is what your macro should look like :

ActiveWorkbook.Worksheets.Add
With ActiveSheet.QueryTables.Add(Connection:= _
“URL;http://startupdb.arabnet.me/startup/2″, Destination:=Range(“$A$1″))
.Name = “2″
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlEntirePage
.WebFormatting = xlWebFormattingNone
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With

You should modify it – by adding a loop to look like this (modified parts are in red )

On error resume next — some Ids do not exist, so this is just a cheap escape loop

Dim i As Integer

For i = 2 To 10 –modify this to become 3000 for the full data, but first run a test with 10 pages to see if it works

ActiveWorkbook.Worksheets.Add
With ActiveSheet.QueryTables.Add(Connection:= _
“URL;http://startupdb.arabnet.me/startup/” & i, Destination:=Range(“$A$1″))
.Name = “2″
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlEntirePage
.WebFormatting = xlWebFormattingNone
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With

Next i

Now that you have the data – you have to

  • Clean it up from unwanted fields like headers
  • Turn every sheete into a row

Of course you can do that via macros – but it is a bit complicated for this post, but i hope I got you started on data scrapping

Downloading the Data

I took the trouble to show you how to do Scrapping for further uses – but for the busy and the lazy , you can download the Final excel sheet I worked on – it has the data turned into rows and all

This data is highly accurate – i did a Quality check for around 20 records along the list – the only fields who can have some errors are probably descriptions – I appreciate if you do another Quality check as well and let me know if any errors.

I did not have time for any further analytics – i didn’t even do simple statistics… the next step usually for me is to import into a Database management system ( MSSQL is my favorite ) – It helps clean up the data some more and here are some of the statistics i want to create :

  • average size of startups as in number of employees
  • average age of startups
  • age vs size
  • categories distribution
  • countries distribution
  • categories/countries/age/size – to compare the startup scene in different countries
  • top keywords in description to see what those startups are all about
  • blablablabla

But i am really short on time and i kinda have to blog about arabnet, so i am sharing the data with you so you can do some mining. I hope to have some free time soon for this .

I am also in desperate need for a visual person ( graphic designer ) to create infographics and visualize data – i have a few tons of data around and would love to collaborate with someone on it. If you are that visual person or someone who can finance such research – please get in touch …

Geir heik, see you in arabnet! it is getting closer by the day!

Share Button

0 Comments

Trackbacks/Pingbacks

  1. Analytics of 150 Arabnet Startup companies | Mireille Raad - [...] s.parentNode.insertBefore(po, s); })(); EmailAfter my previous post on how to scrap data off arabnet’s website and turn it into ...

Leave a Comment

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>