We are working with traffic data of Madrid City. Due the complex of project, we divided it in some stages.
On a first stage, our project will collect the open data at: http://datos.madrid.es/. This is the open data page of Ayuntamiento de Madrid
This dataset have information divided by months and added data by years. The size is around 6G per year, and we need two years or more to look for seasonality patterns, so our team is going establish workflows to make, first at all, a simple tool to visualize and study data. With this tool we are very interested in events, like music festivals, parades, sport matches, and the way they affect the flow of cars. We are also looking in daily events like rush hour on a city scale, but also local one like fathers bringing and collecting children in schools. This tool, using on an unsupervised mode, can also allow us to find unusual data and try to relate it to events.
In this phase, we are considering the effect of stationality patterns and removing it from the data. Since we are working with big data and combining large datasets for the analysis, we will load all the pre-prepared information and use dashDB to show the data.
The second stage of the project is to include others datasets, looking for relations on this datasets could be related to traffic. The idea is to merge data of different origins. This way, we improve the original data but also we hope to obtain insights on the influence of other forms of transport in the city (pedestrian, bicycles, bus, metro). At this stage we are planning to use IBM Analytics for Hadoop and IBM Insights for Twitter , combining Hive and a Map Reduce approach to accomplish better and faster results. All the data mixed and analyzed using sentiment analysis, twitter
Some of our target datasets are:
EMT, weather, bicycles, traffic cameras, security cameras, twitter, traffic lights, taxi, metro, waze, schools locations.
The final result of phase two will be a public API that will be available for the residents of Madrid, when you can generate your own insights .
The final stage it will deploy a real-time analysis of traffic based in many datasets. It will be difficult due the need of relative big amount of data in real-time, the use of different kind of data, and the different values due many types of sensors.