Commit e88d3a25 authored by Gerhard Gossen's avatar Gerhard Gossen
Browse files

Update version references

parent f3c130ad
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
:icons: font :icons: font
:toc: preamble :toc: preamble
:source-highlighter: pygments :source-highlighter: pygments
:version: 0.1.0
Gerhard Gossen <gossen@l3s.de> Gerhard Gossen <gossen@l3s.de>
...@@ -11,7 +12,7 @@ This makes data analysis appear daunting, especially for non-computer scientists ...@@ -11,7 +12,7 @@ This makes data analysis appear daunting, especially for non-computer scientists
These collections constitute an increasingly important source for researchers in the social sciences, the historical sciences and journalists interested in studying past events. These collections constitute an increasingly important source for researchers in the social sciences, the historical sciences and journalists interested in studying past events.
However, there are currently no access methods that help users to efficiently access information, in particular about specific events, beyond the retrieval of individual disconnected documents. However, there are currently no access methods that help users to efficiently access information, in particular about specific events, beyond the retrieval of individual disconnected documents.
Therefore we provide a method to extract event-centric document collections from large scale Web archives. Therefore we provide a method to extract event-centric document collections from large scale Web archives.
The algorithm used to extract the documents is described in our TPDL'17 paper https://doi.org/10.1007/978-3-319-67008-9_10[Extracting Event-Centric Document Collections from Large-Scale Web Archives] (https://arxiv.org/abs/1707.09217[preprint]). The algorithm used to extract the documents is described in our TPDL'17 paper https://doi.org/10.1007/978-3-319-67008-9_10[Extracting Event-Centric Document Collections from Large-Scale Web Archives] (https://arxiv.org/abs/1707.09217[preprint]).
== Setup == Setup
...@@ -23,10 +24,10 @@ Download and build the code by running the following shell commands: ...@@ -23,10 +24,10 @@ Download and build the code by running the following shell commands:
---- ----
git clone https://github.com/gerhardgossen/archive-recrawling git clone https://github.com/gerhardgossen/archive-recrawling
cd archive-recrawling/code cd archive-recrawling/code
mvn package mvn package -Prun
---- ----
Copy the runnable version in `target/archive-crawler-$VERSION.jar` to a cluster machine. Copy the runnable version in `target/archive-crawler-{version}.jar` to a cluster machine.
== Specifying Events == Specifying Events
...@@ -82,17 +83,17 @@ CAUTION: By default the pages are retrieved from the https://de.wikipedia.org/[G ...@@ -82,17 +83,17 @@ CAUTION: By default the pages are retrieved from the https://de.wikipedia.org/[G
Now run the Collection Specification Creator tool using Now run the Collection Specification Creator tool using
[source,bash] [source,bash,subs="attributes"]
---- ----
java -cp target/archive-crawler-$VERSION.jar de.l3s.gossen.crawler.tools topicsFile.tsv java -cp target/archive-crawler-{version}.jar de.l3s.icrawl.crawler.tools topicsFile.tsv
---- ----
to create the `.json` files containing the collection specifications. to create the `.json` files containing the collection specifications.
You can optionally also specify a directory where the files should be stored: You can optionally also specify a directory where the files should be stored:
[source,bash] [source,bash,subs="attributes"]
---- ----
java -cp target/archive-crawler-$VERSION.jar de.l3s.gossen.crawler.tools topicsFile.tsv outputDir java -cp target/archive-crawler-{version}.jar de.l3s.icrawl.crawler.tools topicsFile.tsv outputDir
---- ----
The created JSON files can be used as describe below. The created JSON files can be used as describe below.
...@@ -103,16 +104,16 @@ The extraction process needs to be started on a cluster server. ...@@ -103,16 +104,16 @@ The extraction process needs to be started on a cluster server.
Upload the JAR you build during the setup as well as the JSON collection specifications to your server and log in using SSH. Upload the JAR you build during the setup as well as the JSON collection specifications to your server and log in using SSH.
On the server, upload the JAR to HDFS, e.g as follows: On the server, upload the JAR to HDFS, e.g as follows:
[source,bash] [source,bash,subs="attributes"]
---- ----
hadoop fs -put -f archive-crawler-$VERSION.jar hadoop fs -put -f archive-crawler-{version}.jar
---- ----
Now you can run th extraction as Now you can run th extraction as
[source,bash] [source,bash,subs="attributes"]
---- ----
yarn jar archive-crawler-$VERSION.jar hdfs:///user/$USER/archive-crawler-$VERSION.jar topic.json /tmp/archive-crawler-out yarn jar archive-crawler-{version}.jar hdfs:///user/$USER/archive-crawler-{version}.jar topic.json /tmp/archive-crawler-out
---- ----
This command takes the following parameters: This command takes the following parameters:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment