Unlocking hidden data from the web

Session Overview: This session introduces a few of the more advanced methods for finding data “in the web” and demonstrates how to understand and exploit it. We also explore beyond CSV and introduce JSON, the increasingly dominant format for data “in the web”.

Session number: 12

Participants: Data journalists, data scientists and those with an interest in finding new data sources

Type: Training

Length: 2-3 hours

Exercises: Yes

Web based exercises: Yes

What to bring: Slides, Web-Connected Laptop


Session Flow:

  1. Defining hidden data - The facilitator should define hidden data for the participants through a discussion of the difference between data on the web (downloadable or ‘traditional Open Data) and data in the web (data found in the code of websites or otherwise embedded). The discussion should look at how data in the web is often made visible through pages but is more difficult to access through traditional means.
  2. Techniques for extracting hidden data - The facilitator should lead participants through an interactive exploration of some key methods for obtaining data found in the web. Techniques should include adding extensions to URLs, RSS feeds, inspecting source code, content negotiation, APIs and web scraping.
  3. Tools for data extraction - The participants will undertake directed exploration of data extraction tools such as those listed below as well as any introduced by the facilitator. Example tools may include the ‘Hidden Data Extractor’, PDF tables and




Companion eLearning Modules:

When running this session, we recommend that participants complete the following eLearning module before attending:

Finding hidden data on the web

Completion of the module will help your learners develop a shared understanding of the material before the course and allow you to focus in greater depth on those topics of most interest to the trainees.