1. OpenRefine
这是一款高人气数据分析工具,适用于各类与分析相关的任务。这意味着即使大家拥有多川不同数据类型及名称,这款工具亦能够利用其强大的聚类算法完成条目分组。在聚类完成后,分析即可开始。
2. Hadoop
大数据与Hadoop可谓密不可分。这套软件库兼框架能够利用简单的编程模型将大规模数据集分发于计算机集群当中。其尤为擅长处理大规模数据并使其可用于本地设备当中。作为Hadoop的开发方,Apache亦在不断强化这款工具以提升其实际效果。
3. Storm
同样来自Apache的Storm是另一款伟大的实时计算系统,能够极大强化无限数据流的处理效果。其亦可用于执行多种其它与大数据相关的任务,具体包括分布式RPC、持续处理、在线机器学习以及实时分析等等。使用Storm的另一大优势在于,其整合了大量其它技术,从而进一步降低大数据处理的复杂性。
4. Plotly
这是一款数据可视化工具,可兼容JavaScript、MATLAB、Python以及R等语言。Plotly甚至能够帮助不具备代码编写技能或者时间的用户完成动态可视化处理。这款工具常由新一代数据科学家使用,因为其属于一款业务开发平台且能够快速完成大规模数据的理解与分析。
5. Rapidminer
作为另一款大数据处理必要工具,Rapidminer属于一套开源数据科学平台,且通过可视化编程机制发挥作用。其功能包括对模型进行修改、分析与创建,且能够快速将结果整合至业务流程当中。Rapidminer目前备受瞩目,且已经成为众多知名数据科学家心目中的可靠工具。
6. Cassandra
Apache Cassandra 是另一款值得关注的工具,因为其能够有效且高效地对大规模数据加以管理。它属于一套可扩展NoSQL数据库,能够监控多座数据中心内的数据并已经在Netflix及eBay等知名企业当中效力。
7. Hadoop MapReduce
这是一套软件框架,允许用户利用其编写出以可靠方式并发处理大规模数据的应用。MapReduce应用主要负责完成两项任务,即映射与规约,并由此提供多种数据处理结果。这款工具最初由谷歌公司开发完成。
1. OpenRefine
This is a popular data analysis tool that is suitable for all kinds of analysis-related tasks. This means that even if you have many different data types and names, this tool can group items using its powerful clustering algorithm. Once the clustering is completed, the analysis can begin.
2. Hadoop
Big data and Hadoop are inseparable. This software library and framework can distribute large data sets across computer clusters using simple programming models. It is particularly good at processing large amounts of data and making it available to local devices. As the developer of Hadoop, Apache is also constantly enhancing this tool to improve its actual effectiveness.
3. Storm
Storm, also from Apache, is another great real-time computing system that can greatly enhance the processing of unlimited data streams. It can also be used to perform a variety of other big data-related tasks, including distributed RPC, continuous processing, online machine learning, and real-time analysis. Another big advantage of using Storm is that it integrates a lot of other technologies to further reduce the complexity of big data processing.
4. Plotly
This is a data visualization tool that is compatible with languages such as JavaScript, MATLAB, Python, and R. Plotly can even help users who do not have the skills or time to write code to complete dynamic visualization processing. This tool is often used by the new generation of data scientists because it is a business development platform and can quickly understand and analyze large-scale data.
5. Rapidminer
Another essential tool for big data processing, Rapidminer is an open source data science platform that works through visual programming mechanisms. Its functions include modifying, analyzing and creating models, and can quickly integrate results into business processes. Rapidminer is currently attracting much attention and has become a reliable tool in the minds of many well-known data scientists.
6. Cassandra
Apache Cassandra is another tool worth paying attention to because it can effectively and efficiently manage large-scale data. It is a scalable NoSQL database that can monitor data in multiple data centers and has been used by well-known companies such as Netflix and eBay.
7. Hadoop MapReduce
This is a software framework that allows users to write applications that process large amounts of data concurrently in a reliable manner. MapReduce applications are mainly responsible for completing two tasks, mapping and reducing, and thus provide a variety of data processing results. This tool was originally developed by Google.