• Blog
  • Info Support
  • Career
  • Training
  • International Group
  • Info Support
  • Blog
  • Career
  • Training
  • International Group
  • Search
logo InfoSupport
  • Latest blogs
  • Popular blogs
  • Experts
      • All
      • Bloggers
      • Speakers
  • Meet us
  • About us
    • nl
    • en
    • .NET
    • 3D printing
    • Advanced Analytics
    • Agile
    • Akka
    • Alexa
    • Algorithms
    • Api's
    • Architectuur
    • Artificial Intelligence
    • ATDD
    • Augmented Reality
    • AWS
    • Azure
    • Big Data
    • Blockchain
    • Business Intelligence
    • Chatbots
    • Cloud
    • Code Combat
    • Cognitive Services
    • Communicatie
    • Containers
    • Continuous Delivery
    • CQRS
    • Cyber Security
    • Dapr
    • Data
    • Data & Analystics
    • Data Science
    • Data Warehousing
    • Databricks
    • DataOps
    • Developers life
    • DevOps
    • Digital Days
    • Digital Twin
    • Docker
    • eHealth
    • Enterprise Architecture
    • Event Sourcing
    • Hacking
    • Infrastructure & Hosting
    • Innovatie
    • Integration
    • Internet of Things
    • Java
    • Machine Learning
    • Microservices
    • Microsoft
    • Microsoft Bot Framework
    • Microsoft Data Platform
    • Mobile Development
    • Mutation Testing
    • Open source
    • Pepper
    • Power BI
    • Privacy & Ethiek
    • Python
    • Quality Assistance & Test
    • Quality Assurance & Test
    • Requirements Management
    • Scala
    • Scratch
    • Security
    • SharePoint
    • Software Architecture
    • Software development
    • Software Factory
    • SQL Server
    • SSL
    • Start-up
    • Startup thinking
    • Stryker
    • Test Quality
    • Testing
    • TLS
    • TypeScript
    • Various
    • Web Development
    • Web-scale IT
    • Xamarin
    • All
    • Bloggers
    • Speakers
Home » Data Explorer: Loading less-structured data inside Excel
  • Data Explorer: Loading less-structured data inside Excel

    • By Koos van Strien
    • Business Intelligence 9 years ago
    • Business Intelligence 0 comments
    • Business Intelligence Business Intelligence
    Data Explorer: Loading less-structured data inside Excel

    About Data Explorer

    The free Excel-plugin Data Explorer has been around for a few months now. For those who don’t know about Data Explorer yet: Data Explorer can be described as a self-service ETL-tool: it extracts data from all kinds of sources, and loads it into Excel, while transforming it into a usable format. After loading the data into Excel, the data can be used for whatever you want to.

    The ability to load data towards Excel has been around here for years. Connections to CSV files, databases and other sources could be made from Excel within, and data could be pulled into Excel. From Excel 2010 on, we can load data into the relational PowerPivot engine. Starting with Excel 2013 the data can be loaded into the Excel Data Model. What does Data Explorer add to the options we already had?

    The options to import data into Excel without Data Explorer are via a fixed set of connectors. The connectors in Excel have the ability to connect to several types of databases, and some types of less structured data like CSV. Although there are some configuration options for those less structured data files, the options to customize are limited. Ever tried to import a CSV file with multiple header rows into Excel? (Schema.ini anyone?) Maybe a CSV file where every third row needs to be skipped?  Or only the rows that have a certain value in the first column? And what about data from a HTML file, via a real (refreshable) data connection?

    Until now, data that had to be loaded into Excel needed some kind of schema. In general, the schema was quite rigid. With Data Explorer, it’s possible (and quite easy, too) to load less structured into Excel, to transform it into a usable format, and to automate the transformation. This turns out to be tremendously powerful: Skipping lines based on content, moving the value of every second row inside a new column, joining multiple tables and all examples mentioned above are possible.

    Of course, not all this can be covered inside one blog post – so to let’s start with a simple case. In a next blogpost we’ll build further upon this example.

    Example: Let’s load some data!

    Case: Suppose we want to draw all gas stations in the Netherlands on a map in GeoFlow, like displayed in the following image.

    Tankstations voorbeeld
    (In reality, there are more gas stations in the Netherlands. It should be clear what we want though)

    Although that shouldn’t be too hard a task, after some research it seems there is no OData feed, Azure Marketplace source or another (technically easy to use) list of all gas stations in the Netherlands. Luckily, some web sites are providing a list of gas stations along with their addresses. In one case, retrieving the data would require loading several thousand pages, which may be somewhat too hard to try (try to find all addresses of gas stations at www.alletankstations.nl for example). Another site looks easier: brandstof-zoeker.nl provides an easy overview of gas stations categorized alphabetically, according to the company name of the gas station.

    Initially we focus on all gas stations starting with a ‘B’, as listed on http://www.brandstof-zoeker.nl/station/B.

    Inside a new Excel workbook, we head to the ‘Data Explorer’ tab on the ribbon, and select ‘Get external data’ -> ‘From Web’. This opens up a Data Explorer dialog, where we can enter the URL where our data is located.

    After clicking ‘OK’, Data Explorer loads the web site from the URL we provided, and gives a high-level overview of the Document Object Model (DOM). The root node of the DOM is ‘Document’. Clicking on ‘Document’ on the left shows us the contents of that node on the right.

    2. Inside Data Explorer

    Inside the table, the column ‘Children’ is visible. The DOM has a tree structure: the root node has children, each of which can have their own children. To drill down towards the children, click on the text ‘Table’ in the column ‘Children’. This opens a table containing the child elements which reside directly under the ‘Document’ level.

    3. Children 1

    In this way, it’s quite easy to find any element in a given document, which we will be able to load afterwards. When you’ve located the place of the list gas stations in the DOM (and selected it), Data Explorer looks like the following image:

    4. Source gevonden

    Click ‘Done’ now, and the data will be loaded to the Excel sheet.

    Data geladen

    At this moment, a real data source connection has been made towards a web page. This means that the data can be refreshed by the ‘Refresh All’ option inside Excel, and Data Explorer will take a look for us at the web address we point at.

    Reflection: where are we heading?

    Let’s reflect on what we’ve just done: by using Data Explorer, without any coding, we are able to retrieve data from a loose-structured source like a webpage. More than retrieving alone, we have a real, refreshable, data connection which retrieves a set of data from a website. Website updates data, we refresh source. Great, isn’t it?

    In my opinion this shows the next step in self-service Business Intelligence: It has been possible to do self-service analysis using PowerPivot for several years. Inside PowerPivot data could be integrated from several sources:

    • data warehouses
    • production databases
    • Excel files
    • CSV files
    • Sharepoint lists
    • OData feeds
    • (…)

    With Data Explorer it has become quite easy to add less-structured data, that was without Data Explorer quite hard to load (and thus to integrate). Examples of these sourcesa are

    • HTML files
    • Hadoop data
    • Active Directory users
    • Facebook data (events, timelines, friends, …)

    By the way: Facebook data integrates in a beautiful way – you really have to try it!

    Facebook graph
    (names and id’s are removed)

    Conclusion: looking forward for the next blogpost

    Of course, this is not where it stops! Although it’s certainly useful to be able to load data in this way into an Excel sheet, the data we have retrieved here is not quite ready to use, because the data is dirty:

    • There’s lots of ‘noise data’ (I’m not interested in the ‘H3’ elements, for example)
    • All useful data is in the same column, without distinguishing hints in other columns about which data it could be:
    • 3D: ‘Centraleweg’, is it a street, or maybe a town?
    • There seems to be a pattern: the second row in the table (row 3 in the sheet) is an address, and every 12th row after that a new address starts. However, this pattern stops at row 195..
    • The names of gas stations seem to be excluded. Further investigation learns that the H3 element has its own structure inside the DOM, so it’s located inside a new table.

    In the next blogpost I will make a start with the cleansing process, by extending and programming Data Explorer.

    For now, go forth and Data Explore!

    Share this

Koos van Strien

View profile

Related IT training

Go to training website

Related Consultancy solutions

Go to infosupport.com

Related blogs

  • "Garbage in, Garbage out" Vincent Lukassen - 2 years ago

  • Data Governance, een hot topic in menig organisatie

    Data Governance, een hot topic in menig organisatie Vincent Lukassen - 3 years ago

  • Foodsector laat kansen liggen door beperkt gebruik van …

    Foodsector laat kansen liggen door beperkt gebruik van … Hans Geurtsen - 4 years ago

Data Discovery Channel

  • Modern Data Platform

  • Gartner Data & Analytics Summit 2022

  • De Data Architecture ®Evolution

Nieuwsbrief

* verplichte velden

Contact

  • Head office NL
  • Kruisboog 42
  • 3905 TG Veenendaal
  • T +31 318 552020
  • Call
  • Mail
  • Directions
  • Head office BE
  • Generaal De Wittelaan 17
  • bus 30 2800 Mechelen
  • T +32 15 286370
  • Call
  • Mail
  • Directions

Follow us

  • Twitter
  • Facebook
  • Linkedin
  • Youtube

Newsletter

Sign in

Extra

  • Media Library
  • Disclaimer
  • Algemene voorwaarden
  • ISHBS Webmail
  • Extranet
Beheer cookie toestemming
Deze website maakt gebruik van Functionele en Analytische cookies voor website optimalisatie en statistieken.
Functioneel Always active
De technische opslag of toegang is strikt noodzakelijk voor het legitieme doel het gebruik mogelijk te maken van een specifieke dienst waarom de abonnee of gebruiker uitdrukkelijk heeft gevraagd, of met als enig doel de uitvoering van de transmissie van een communicatie over een elektronisch communicatienetwerk.
Voorkeuren
De technische opslag of toegang is noodzakelijk voor het legitieme doel voorkeuren op te slaan die niet door de abonnee of gebruiker zijn aangevraagd.
Statistieken
De technische opslag of toegang die uitsluitend voor statistische doeleinden wordt gebruikt. De technische opslag of toegang die uitsluitend wordt gebruikt voor anonieme statistische doeleinden. Zonder dagvaarding, vrijwillige naleving door uw Internet Service Provider, of aanvullende gegevens van een derde partij, kan informatie die alleen voor dit doel wordt opgeslagen of opgehaald gewoonlijk niet worden gebruikt om je te identificeren.
Marketing
De technische opslag of toegang is nodig om gebruikersprofielen op te stellen voor het verzenden van reclame, of om de gebruiker op een website of over verschillende websites te volgen voor soortgelijke marketingdoeleinden.
Manage options Manage services Manage vendors Read more about these purposes
Voorkeuren
{title} {title} {title}