Commit e88d3a25 authored by Gerhard Gossen's avatar Gerhard Gossen

Update version references

parent f3c130ad
......@@ -3,6 +3,7 @@
:icons: font
:toc: preamble
:source-highlighter: pygments
:version: 0.1.0
Gerhard Gossen <gossen@l3s.de>
......@@ -11,7 +12,7 @@ This makes data analysis appear daunting, especially for non-computer scientists
These collections constitute an increasingly important source for researchers in the social sciences, the historical sciences and journalists interested in studying past events.
However, there are currently no access methods that help users to efficiently access information, in particular about specific events, beyond the retrieval of individual disconnected documents.
Therefore we provide a method to extract event-centric document collections from large scale Web archives.
The algorithm used to extract the documents is described in our TPDL'17 paper https://doi.org/10.1007/978-3-319-67008-9_10[Extracting Event-Centric Document Collections from Large-Scale Web Archives] (https://arxiv.org/abs/1707.09217[preprint]).
The algorithm used to extract the documents is described in our TPDL'17 paper https://doi.org/10.1007/978-3-319-67008-9_10[Extracting Event-Centric Document Collections from Large-Scale Web Archives] (https://arxiv.org/abs/1707.09217[preprint]).
== Setup
......@@ -23,10 +24,10 @@ Download and build the code by running the following shell commands:
----
git clone https://github.com/gerhardgossen/archive-recrawling
cd archive-recrawling/code
mvn package
mvn package -Prun
----
Copy the runnable version in `target/archive-crawler-$VERSION.jar` to a cluster machine.
Copy the runnable version in `target/archive-crawler-{version}.jar` to a cluster machine.
== Specifying Events
......@@ -82,17 +83,17 @@ CAUTION: By default the pages are retrieved from the https://de.wikipedia.org/[G
Now run the Collection Specification Creator tool using
[source,bash]
[source,bash,subs="attributes"]
----
java -cp target/archive-crawler-$VERSION.jar de.l3s.gossen.crawler.tools topicsFile.tsv
java -cp target/archive-crawler-{version}.jar de.l3s.icrawl.crawler.tools topicsFile.tsv
----
to create the `.json` files containing the collection specifications.
You can optionally also specify a directory where the files should be stored:
[source,bash]
[source,bash,subs="attributes"]
----
java -cp target/archive-crawler-$VERSION.jar de.l3s.gossen.crawler.tools topicsFile.tsv outputDir
java -cp target/archive-crawler-{version}.jar de.l3s.icrawl.crawler.tools topicsFile.tsv outputDir
----
The created JSON files can be used as describe below.
......@@ -103,16 +104,16 @@ The extraction process needs to be started on a cluster server.
Upload the JAR you build during the setup as well as the JSON collection specifications to your server and log in using SSH.
On the server, upload the JAR to HDFS, e.g as follows:
[source,bash]
[source,bash,subs="attributes"]
----
hadoop fs -put -f archive-crawler-$VERSION.jar
hadoop fs -put -f archive-crawler-{version}.jar
----
Now you can run th extraction as
[source,bash]
[source,bash,subs="attributes"]
----
yarn jar archive-crawler-$VERSION.jar hdfs:///user/$USER/archive-crawler-$VERSION.jar topic.json /tmp/archive-crawler-out
yarn jar archive-crawler-{version}.jar hdfs:///user/$USER/archive-crawler-{version}.jar topic.json /tmp/archive-crawler-out
----
This command takes the following parameters:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment