Commit f3c130ad authored by Gerhard Gossen's avatar Gerhard Gossen
Browse files

Initial version of tutorial

asciidoctor (
coderay (1.1.2)
em-websocket (0.5.1)
eventmachine (>= 0.12.9)
http_parser.rb (~> 0.6.0)
eventmachine (1.2.5)
ffi (1.9.18)
formatador (0.2.5)
guard (2.14.1)
formatador (>= 0.2.4)
listen (>= 2.7, < 4.0)
lumberjack (~> 1.0)
nenv (~> 0.1)
notiffany (~> 0.0)
pry (>= 0.9.12)
shellany (~> 0.0)
thor (>= 0.18.1)
guard-asciidoctor (0.1.1)
guard (~> 2.0)
guard-compat (~> 1.1)
guard-compat (1.2.1)
guard-livereload (2.5.2)
em-websocket (~> 0.5)
guard (~> 2.8)
guard-compat (~> 1.0)
multi_json (~> 1.8)
guard-shell (0.7.1)
guard (>= 2.0.0)
guard-compat (~> 1.0)
http_parser.rb (0.6.0)
listen (3.1.5)
rb-fsevent (~> 0.9, >= 0.9.4)
rb-inotify (~> 0.9, >= 0.9.7)
ruby_dep (~> 1.2)
lumberjack (1.0.12)
method_source (0.9.0)
multi_json (1.12.2)
nenv (0.3.0)
notiffany (0.1.1)
nenv (~> 0.1)
shellany (~> 0.0)
pry (0.11.3)
coderay (~> 1.1.0)
method_source (~> 0.9.0)
pygments.rb (1.2.0)
multi_json (>= 1.0.0)
rb-fsevent (0.10.2)
rb-inotify (0.9.10)
ffi (>= 0.5.0, < 2)
ruby_dep (1.5.0)
shellany (0.0.1)
thor (0.20.0)
yajl-ruby (1.3.1)
rb-inotify (~> 0.9)
Bundler.require :default
require 'erb'
guard 'shell' do
watch(%r{^[a-zA-z].+\.adoc$}) do |m|
`asciidoctor #{m[0]}`
guard 'livereload' do
.PHONY: publish
publish: archive-crawling.html
scp archive-crawling.html fs3:public_html/sobigdata/tutorial/index.html
%.html: %.adoc
asciidoctor $<
// -*- mode: adoc; -*-
= Extracting Event-Based Collections from Web Archives
:icons: font
:toc: preamble
:source-highlighter: pygments
Gerhard Gossen <>
Web archives are typically very broad in scope and extremely large in scale.
This makes data analysis appear daunting, especially for non-computer scientists.
These collections constitute an increasingly important source for researchers in the social sciences, the historical sciences and journalists interested in studying past events.
However, there are currently no access methods that help users to efficiently access information, in particular about specific events, beyond the retrieval of individual disconnected documents.
Therefore we provide a method to extract event-centric document collections from large scale Web archives.
The algorithm used to extract the documents is described in our TPDL'17 paper[Extracting Event-Centric Document Collections from Large-Scale Web Archives] ([preprint]).
== Setup
The code for the collection extractor is available at[Github].
Download and build the code by running the following shell commands:
git clone
cd archive-recrawling/code
mvn package
Copy the runnable version in `target/archive-crawler-$VERSION.jar` to a cluster machine.
== Specifying Events
Each extracted collection is based on a given event.
The event is described as a JSON document.
The Git repository contains some example descriptions in the directory[collection_specifications].
The most important properties of the specification object are:
Short name of the collection.
Human-readable description of the collection topic.
Documents that describe the collection topic, including documents used to create this specification (not used during the extraction).
Fallback language if the document language cannot be determined or if there is no language model for the detected language.
Array of relevant documents that are used as starting points.
Time span of the event, with additional information about expected relevance of documents before and after that period.
One map for each relevant language containing the relative relevance of terms or term n-grams.
One array for each relevant language containing especially typical keywords or entity names.
=== Creating Collection Specifications From Wikipedia
You can create collection specifications from Wikipedia using a provided script.
First, you need to create a `.tsv` file (tab separated text file) with the metadata and the names of the relevant Wikipedia pages.
This file should have the following header (tab-separated) and one line for each collection topic with the corresponding information:
.Wikipedia collection specification file header
Name From Until Before After Description Wikipedia
The columns have the following meaning:
Short name of the collection.
Human-readable description of the collection topic.
from, until::
Time period of the event (used as reference time), format is `YYYY-mm-dd`(ISO 8601 date).
before, after::
Approximation of time periods before and after the event with relevant information ([ISO 8601 period], e.g. P30D).
Names of Wikipedia pages, comma separated. You can also mix in names of WikiNews categories prefixed with `news:`, e.g. `news:Olympische_Winterspiele_2002`.
CAUTION: By default the pages are retrieved from the[German Wikipedia], you'll have to modify the code to use it on a different version.
Now run the Collection Specification Creator tool using
java -cp target/archive-crawler-$VERSION.jar topicsFile.tsv
to create the `.json` files containing the collection specifications.
You can optionally also specify a directory where the files should be stored:
java -cp target/archive-crawler-$VERSION.jar topicsFile.tsv outputDir
The created JSON files can be used as describe below.
== Extracting Collections
The extraction process needs to be started on a cluster server.
Upload the JAR you build during the setup as well as the JSON collection specifications to your server and log in using SSH.
On the server, upload the JAR to HDFS, e.g as follows:
hadoop fs -put -f archive-crawler-$VERSION.jar
Now you can run th extraction as
yarn jar archive-crawler-$VERSION.jar hdfs:///user/$USER/archive-crawler-$VERSION.jar topic.json /tmp/archive-crawler-out
This command takes the following parameters:
JAR path in HDFS::
The location where you uploaded your JAR (required).
Collection specification filename::
Path to the collection specification relative to your workung directory (required).
HDFS output path::
Name of the directory used to store the results (required).
This path may not exist before you run the command, existing directories will never be overwritten.
Number of URLs::
Maximum number of URLs to analyze (optional)
If this parameter is not given, the process will stop after analyzing 10.000 documents.
Weighting Method::
The weighting method used to estimate document relevance (optional).
Default is `CONTENT_AND_TIME`.
Relevance Threshold::
Minimum relevance that a document should have to be extracted (optional).
Snapshots to Analyze::
Number of versions to analyze for each URL (optional).
Default is 1, i.e. only look at the earliest version in the event timespan or the version closest to the event timespan.
Once you start the extraction process, you will get the URL of the monitoring Web interface that you can use to check the current status.
After the process has finished, it will create a ZIP file in the HDFS output path specified above.
The name is based on the collection name in the specification with the suffix `.zip`.
== Analysing Collections
=== Zip file output format
The zip file output format for extracted collections is a standard ZIP file containing:
- the extracted source files (named `0.html`, `1.html`, ...).
NOTE: All files are re-encoded to UTF-8 to ease further processing, even if the HTML meta tags disagree.
- an overview file (`urls.csv`).
- a file listing URLs that would have been included, but are missing from the archive (`missing.csv`)
The latter two files have the following columns:
[%header, cols = "1,5" ]
| column | description
| url | original URL of the document
| path | crawl path for reaching the document (`S`= seed, `L`= link)
| relevance | estimated relevance of the document ([0.0, 1.0])
| crawlTime | time the document was retrieved from the web (ISO 8601)
| file | name of file in .zip
[%header, cols = "1,5" ]
| column | description
| url | original URL of the document
| path | crawl path for reaching the document (`S`= seed, `L`= link)
| priority | estimated relevance of the document ([0.0, 1.0])
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment