Download and build the code by running the following shell commands:
git clone
cd archive-recrawling/code
mvn package
mvn package -Prun
Copy the runnable version in `target/archive-crawler-{version}.jar` to a cluster machine.
== Specifying Events
CAUTION: By default the pages are retrieved from the
Now run the Collection Specification Creator tool using
java -cp target/archive-crawler-{version}.jar topicsFile.tsv
to create the `.json` files containing the collection specifications.
You can optionally also specify a directory where the files should be stored:
java -cp target/archive-crawler-{version}.jar topicsFile.tsv outputDir
The created JSON files can be used as describe below.
The extraction process needs to be started on a cluster server.
Upload the JAR you build during the setup as well as the JSON collection specifications to your server and log in using SSH.
On the server, upload the JAR to HDFS, e.g as follows:
hadoop fs -put -f archive-crawler-{version}.jar
Now you can run th extraction as
yarn jar archive-crawler-{version}.jar hdfs:///user/$USER/archive-crawler-{version}.jar topic.json /tmp/archive-crawler-out
This command takes the following parameters:
