Here is a little tutorial you might find useful when scraping websites and you’re unwilling to reveal your IP address while doing so.
First, you should know about Tor. If you come here, chances are that you heard of Tor before. It’s a Peer-to-peer system that encrypts and redirects network traffic to enable anonymous web use. All you need to do is install the latest Tow Browser Bundle which you can find on the Tor website.
On May 28 I had the occasion to talk at Köln International School of Design (KISD) about how the availability of data transforms and enables services, especially in the mobility world. The video shows the presentation slides, me and the audience are audio-only. Thanks to KISD for having me!
Enabling yourself to work with data sources from the web still has a lot to do with scraping. And, I assume, it will for a long time to come. Even if the data you need is available via a webservice API, what you do in order to harvest larger amounts of data is something like scraping. Again and again you will have to answer the same questions.
- What language should I use to write my scraper?
- Which HTTP client should I use?
- How can I parse HTML input, maybe XML, JSON and all the rest?
- How can I scrape pages that require form submission, session handling etc.?
- How can I store the data?
- How can I avoid scraping the same stuff over and over again?
- How can I make the scraper fault tolerant, so that it won’t brake with a single 404 response?
If you’re simply gathering content from one HTML table, you probably won’t mind. But as your projects grow larger, these questions wil become more important.
Read on and discuss (4 comments)
A German social scientists has observed a correlation between percentage of votes for the green party and low voter turnout. This has motivated him to write a book about how the green party puts our democracy at risk. Which in turn has motivated me to look into his arguments.
I have a geeky personal long-term project going on: I run to create data. Of course, I run to become a fitter person. Actually I also run to be a more persistent mountain biker and it takes less time to train the endurance by running than by mountain biking. But, as a consequence, I run more or less consistently for about a year by now. (I ran before, but that was about 2003, and I din’t collect data back then.)
By now, I have collected data for about 80 runs. Here is when I ran, and how far.
About my data collection
I am wearing a Zephyr HxM heart rate sensor on a chest belt, which also collects cadence data. This data is recorded by the SportsTracker app on my Android phone. This app also records, of course, position over time via the phone’s GPS sensor. During the course of my data gathering, two different Android phones have done the job: An HTC Desire and a Samsung Galaxy SII.
Read on and discuss (8 comments)
An note to the reader (May 2014): This article is pretty dated (Nov 2011). I haven’t worked with CouchDB since, so I can’t tell whether things have improved, but I guess it’s fair to assume they have. Feel free to comment about your experiences, so people stopping by here get a more balanced view.
With many people talking about CouchDB, I got curious and took a closer look. Especially Map/Reduce made me want to find out if CouchDB would be a good solution to create aggregated statistics from a large number of records. A presentation on SlideShare cought my eye, especially the structured keys and group levels (see slide 43). Since I am now maintaining three growing time series databases (radiation data from Japan and Germany plus air quality data) I wondered if CouchDB would be an option to store and process that data.
Short disclaimer: I am by no means a CouchDB, NoSQL or DBMS expert. I usually make MySQL do the things I want, without ever having exceeding the limits of what a single machine can handle. All I did was some testing, I didn’t go into production. The reason I post my experience anyway is to foster discussion, learn from it and help others save their time.
Read on and discuss (21 comments)
This is a new data visualization I created recently, depicting long-time measurements of emission in the atmosphere in my home area Nordrhein-Westfalen (NRW), Germany. It uses data scraped from LANUV, the environment agency of NRW. (Since the domain is limited to a region in Germany, the application is German, too.)
What I found particularly interesting when analyzing the original measurements was the fact that some measurement stations are equipped with wind direction sensors. For these stations, every measurement of emissions is linked to a wind direction. For some stations, this reveals differences in emission levels, depending on the wind direction.
There are definitely many interesting ways to analyse the measurements, aside from wind direction. However I decided to start with a very limited set of displays, get the application public and then gather feedback on what users would find interesting and what puzzles them.
So please let the feedback flow! Comment here (or on the german post at G+) or contact me (see right hand side).
The german Bundesamt für Strahlenschutz (BfS) is, among other duties, in charge of measuring gamma radiation in the atmosphere throughout Germany. A network of about 1750 Geiger counters collects data which is displayed publicly on a very basic user interface. Since the Tsunami in Japan and the subsequent, ongoing nuclear desaster in Fukushima, interest in radiation has risen everywhere. According to a BfS press relations officer, the BfS has been confronted with numerous requests to open up their raw data to the public. Fortunately, the BfS has reacted and given access to raw data downloads on a per-users basis.
I have started archiving that data in order to be able to create time-series visualizations. The above video shows a first non-interactive attempt with the data I have so far. (Unfortunately, the BfS only publishes 2-hour-interval readings for a 24 hour timespan and 24-hour-interval data for 7 days. So far, there is no access to historic data reaching further into the past.)
So now we can see, for a limited time frame, how radiation behaved dynamically. Not everybody knows that radiation is a fact of daily life. Even fewer might know that it differs quite a bit between locations and also changes over time. To me it’s particularly interesting to see how values rise and fall similar to waves moving over the country. The probable explanation for this is the influence of rain, which causes readings to rise. The BfS has more information about this in german language.
Of course, radiation is a difficult thing to substantiate. Even if numeric values are displayed as colours or as marks on a scale, the meaning remains abstract. And I have no plans to change this.
However, I can imagine a whole lot of other things to try out. For example, why not try to add a layer of rain information, so one can maybe see if a local radiation rise actually correlates with rain. Or, of course, try different measures for values, like circle radius or go into the third dimension. And, naturally, there should be an interactive way to “scrub” over the time scale, so users can choose the time frame and speed for themselves. (With a good video player, this works, but the vimeo player doesn’t really allow for that.) And last but not least, there could be numerous ways to interact with single dots (sensors) or a group of them, to get details, curves etc. or compare radiation on two spots.
What do you think? What would you like to know and see?
Update: The Python source code is available here.
Discuss (21 comments)
This is a big moment for me. I am now a business. Starting now, I am self-employed. Ready to take on your Interaction Design / UXD assignments and willing to excel.
For you, this means that I will no longer see you as someone only interested in my content. I’ll rather regard you as a potential client or project partner. This means that I might think more about your needs and what could make you happy. And it means that I should have much more time for you, because this blog has now become a much more important place for me. Isn’t that nice?
Seriously, more than ever I am willing to get involved with you. Please feel invited to comment on stuff here, ask questions, give feedback and talk. Openly or in private. My contact details are available on every page.
Discuss (11 comments)
It’s time to give a brief update on my efforts to make radiation data provided by the Japanese administration more accessible.
The good news: I can now offer you a way more complete raw data download with higher data quality. Data now goes back until March 1. This means, the complete development of the nuclear crisis since the earth quake and Tsunami is covered.
Please see the Japan Radiation Open Data page for details on the download.
Some bad facts remain. Fukushima and Miyagi prefectures still don’t report values to the SPEEDI system (which is the source of the data). Their values are lacking since 2011-03-11 05:40 UTC. And Ishikawa prefecture hasn’t contributed values since yesterday.
In the meantime, people have started using the data available. The example below is a screenshot from a Japan Radiation Map created by Geir Endahl.
Eron Villarreal has contributed a data dashboard using Tableau Software.
Please comment to let others know what you are doing with the radiation data.
Update: I have done a draft animation to visualize the “burst” effects in Ibaraki around March 14.
Discuss (17 comments)