You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Petter Reinholdtsen 70468bc82d Correct scraper. 5 years ago
bin Avoid hardcoding directory names. 6 years ago
cgi-bin Web frontend prototype. 10 years ago
data Make git ignore .pyc-files, and add a keepalive-file to the data folder 7 years ago
scrapersources Correct scraper. 5 years ago
testlib/lazycache Add a copy of the old lazycache library. 6 years ago
.gitignore Ignore IDEA and data folders 7 years ago
README Update the setup instructions and add a title. 6 years ago
env-setup Added python-requests, python-lxml, and python-cssselect to the list of needed packages. 6 years ago
fetch-scraper-sources Update to use new URLs. 8 years ago
fetch-scraper-sqlite Add script to fetch sqlite files directly. 8 years ago
move-postjournal Add new source. 7 years ago
move-postjournal-elasticsearch Add test script for elasticsearch. 7 years ago
postliste-keysummary Add two scripts. 10 years ago
run-scraper Fix problem with non-ascii output to the log. 5 years ago

README

Scrapers for norweigan post journal sources
===========================================

Classic API code available from

https://bitbucket.org/ScraperWiki/scraperwiki-classic/src/c7f076950476?at=default
https://github.com/rossjones/ScraperWikiX/blob/master/services/scriptmgr/scripts/exec.py


Standalone lib https://github.com/scraperwiki/scraperwiki-python

== Running / testing scrapers ==

To get the scrapers running, one need to set up the data directory and
a patched copy of the scraperwiki-python project. The script
env-setup is provided to do so. Run it from the top of the checked
out scraper directory to set up your own copy.

./env-setup

To run a scraper, use the run-scraper command and give the scraper
name as the argument. For example like this:

./run-scraper postliste-oep

== Common field names ==

List of field names used in most scrapers. All dates uses ISO format,
YYYY-MM-DD, "YYYY-MM-DD HH:MM" or "YYYY-MM-DD HH:MM+TZ".

* agency, name of public administration
* recorddate
* docdate