It is the best of times, it is the worst of times; it is the age of information explosion and it is also the age of big data mining. There was a time when the internet of things (IoT) and cloud computing were the state-of-the-art computing technologies. They are partly the reason and solution for the explosion of information.
In the year of 2006, the International Data Cooperation (IDC) has initiated a program named “digital universe”. Through the statistics, the project has predicted that the total global data shall increase from 0.18 ZB to 1.8 ZB within in 5 years (Grantz, 2008). There are uncountable sources of the flood of data, for examples:
The daily data scale is up to 1 TB at the New Yorker Stock Exchange. There are roughly 100 billion images (estimated size 1 PB) posted on Facebook.
Ancestry.com stores up to 2.5 PB of data in total.
Data scale has been increasing at the speed of 20 TB monthly at The Internet Archive.
A large hadrons collider near Geneva, Switzerland produces 15 PB of data annually. (Microsoft Research, 2014)
It is not only the major cooperates, database or academic institute that have been generating data at such large scale, but also we, as individuals and consumers, are contributing to the development. The data that an individual produces has been growing at an exponential speed according to a research of MyLifeBits project, conducted by Microsoft.
Despite the significant use in science, commercial or social media sector, the big data process technology has great importance for the earth and environment with the dramatic population growth. The scientific tools and technology must adapt the exponent expansion of data in the natural resource, population and geographic information and environment. To better process the information in human geography and environmental sector, we gain a greater knowledge in the Earth we live in. As the growing issues in environmental sector, such as the climate change, the use of sustainable energy resource or space exploration, an application that evolves the research from raw discovery and basic data sampling, and can process a whole range of data is needed. Cloud computing or IoT has helped the extraction of environmental data, but more timely application such as distribute file system can process mess data most effectively. In the book Hadoop: The Definitive Guide (by Tom White), it is presented how high performance data mining tools could change the current information process systems in geographic information. With the distribute file system, a century of meteorological data, with hourly temporal resolution, for example, will only require 42 minutes to be processed compared servers running at full-load for hours to days in the past. The implementation of unstructured information will completely transform the contemporary data efficiency, statistic methodology and geographic information systems.
Developing computer science and technologies will better capture, analyse, visualise and modelling environment information and to assist researcher, scientist and policy-maker to be better informed as in decision-making process.
A measurement for data size. 1 ZB (Zettabyte) = 103 EB (Exabyte) = 106 PB (Petabytes) = 109 TB ( Terabytes) = 1021 B (Byte)
Grantz et al, The diverse and Exploding Digital Universe. [Online] Available at: http://china.emc.com//collateral/analyst-report/expanding-digital-idc-white paper.pdf. March 1, 2008
Microsoft Research.(2015). MyLifeBits. Available at:http://research.microsoft.com/enus/project/myLifebits//defautl.aspx.
Tom White., Doug Cutting, Hadoop Guide version 3.2012. O’Reilly Media, inc. Page 22-80.