本文共 17917 字,大约阅读时间需要 59 分钟。
python数据挖掘:概念
Introduction
介绍
Big Data refers to data collections that are so large and complex that they are difficult for traditional database tools to manage. Big Data is considered as the base of the future in the field of Information Technology (IT). Organizations today are dependent upon the data sizes, which is why their interest is increasing in Big Data analytics. The key to Big Data is organizing data for quick reference to get the source from summaries and indexes. Amazon AWS uses DDN with Lustre, Microsoft has been using Cray with Lustre; and Google uses FUSE or their own storage [1][2][3][4][5].
大数据是指庞大而复杂的数据收集,以至于传统数据库工具难以管理。 大数据被认为是信息技术(IT)领域的未来基础。 当今的组织依赖于数据大小,这就是为什么他们对大数据分析的兴趣日益增加的原因。 大数据的关键是组织数据以供快速参考,以从摘要和索引中获取源。 Amazon AWS在Lustre上使用DDN,Microsoft在Lustre上使用Cray。 Google使用FUSE或自己的存储[1] [2] [3] [4] [5]。
Big Data knowledge can enable crafting the right plan or strategy and make you ready for the battle of the industry. But like all other different fields, if you are new to something, you have to face some problems as challenges. Today, we are here with typical Big Data challenges faced by the organizations along with their solutions.
大数据知识可以帮助制定正确的计划或策略,并使您为行业之战做好准备。 但是,与所有其他不同领域一样,如果您是新手,则必须面对一些挑战。 今天,我们在这里面临着组织及其解决方案所面临的典型大数据挑战。
Understanding
理解
Frequently many organizations neglect to know the advantages and disadvantages of Big Data as a new technology in the market. They are also unable to understand the importance of Big Data for their business organization. Without any reasonable information, they have different perspectives, like it may be dangerous for the project, or maybe it is expensive and many more.
通常,许多组织忽视了将大数据作为市场上的新技术来了解其优缺点。 他们也无法理解大数据对其业务组织的重要性。 如果没有任何合理的信息,他们会有不同的观点,例如对于项目可能很危险,或者可能很昂贵。
You need to do proper research to understand the benefits, advantages and disadvantages of Big Data. Never accept or reject any technology without understanding the deep concept. To see Big Data acknowledgements at different levels, you must complete attending workshops and the various events of Big Data. You can also contact your allies which are using the technology in the present time and also making benefits or profits from it. Big Data is a given, and it is a requirement for Artificial Intelligence, Deep Learning training [6]. To do in-depth learning training you need as much data as possible, the point of Deep Learning is in part to find patterns you may not see. If you are not doing deep learning, you need to process the data by other algorithms and try to keep up with the information as it comes in. Big Data is not done in real-time. We train with the Big Data and use that to find algorithms we apply in real-time, like self-driving cars.
您需要进行适当的研究以了解大数据的优势,劣势。 在不理解深刻概念的情况下,切勿接受或拒绝任何技术。 要查看不同级别的大数据确认,您必须完成参加研讨会和大数据的各种活动。 您还可以与当前正在使用该技术的盟友联系,并从中获得收益或利润。 大数据是给定的,它是人工智能深度学习培训的要求[6]。 要进行深度学习培训,您需要尽可能多的数据,深度学习的部分目的是找到您可能看不到的模式。 如果您不进行深度学习,则需要通过其他算法处理数据,并尝试跟上信息的步伐。大数据不是实时完成的。 我们使用大数据进行训练,并使用它来查找我们实时应用的算法,例如自动驾驶汽车。
Concepts
概念
Data Structures should be established to better manage Big Data. Data structures allow for the effective management and indexing of large data sets. Data structure generally refers to either structured or unstructured data [7].
应该建立数据结构以更好地管理大数据。 数据结构允许对大型数据集进行有效的管理和索引。 数据结构通常是指结构化或非结构化数据[7]。
Structured
结构化的
Unstructured
非结构化
As per the definition and guideline of Big Data, the attributes of Big Data are abridged as "5Vs", i.e., Volume, Variety, Velocity, Value and Veracity. Keeping in mind this is a growing field [8][9].
根据大数据的定义和准则,大数据的属性缩写为“ 5V”,即体积,品种,速度,价值和准确性。 请记住,这是一个不断发展的领域[8] [9]。
The base definition is based on the three V’s: Variety, Volume and Velocity.
基本定义基于三个V:变化,体积和速度。
The importance of Big Data is the value added by measurable, reliable data. The modern version of Big Data still follows the definition of very large, complex data, but recently has been expanded to include the V’s value and veracity.
大数据的重要性是可衡量的,可靠的数据所增加的价值。 大数据的现代版本仍然遵循非常大的复杂数据的定义,但最近已扩展为包括V的值和准确性。
The constant evolution of Big Data means its main concepts are always evolving. Our current understanding will also evolve beyond the 5 Vs, as we further define what Big Data means in the future. Some possible additions to the V’s are the following:
大数据的不断发展意味着其主要概念始终在发展。 随着我们进一步定义未来大数据的含义,我们目前的理解还将超越5V。 V的一些可能添加如下:
Security
安全
Big Data involves the integration of data with various divisions of the business organizations. Many organizations think that Big Data can be a threat when they share information with various third-party software to make data visible for other departments of the organization. Big Data always provides plenty of backend dispersed data storage, which is not supported locally by different platforms. The third-party software can only see the data, but they may access the data for their use.
大数据涉及将数据与业务组织的各个部门进行集成。 许多组织认为,当它们与各种第三方软件共享信息以使数据对组织的其他部门可见时,大数据可能会构成威胁。 大数据始终提供大量的后端分散数据存储,不同平台本地不支持。 第三方软件只能看到数据,但是他们可以访问数据以供使用。
While new technologies are being introduced and Big Data are being used in many ways, the security and confidentiality of Big Data have been considered a concern. Big Data includes various security and privacy concerns. The main issues in (BDS) Big Data Security are protecting and verifying data [10][11].
在引入新技术并以多种方式使用大数据的同时,大数据的安全性和机密性也被认为是一个问题。 大数据包括各种安全和隐私问题。 (BDS)大数据安全性的主要问题是保护和验证数据[10] [11]。
Due to the large volume, speed and diversity of Big Data, the processing of such large data is challenging for conventional security models. This paradigm presents a challenge to security professionals who must adapt to the massive scope of Big Data. The following table lists common threats to Big Data:
由于大数据量大,速度快和多样性大,因此对于常规安全模型而言,处理此类大数据具有挑战性。 这种范例给必须适应大数据范围的安全专业人员带来了挑战。 下表列出了对大数据的常见威胁:
Threats | Description |
Breach of privacy | Big Data is a solution often used to store great volumes of personal information. Such a large store of data may make it easier for an attacker to steal sensitive personal information in one comprehensive attack. |
Privilege escalation | Because Big Data can represent wide swaths of information, some users may be able to view data that they are not authorized to view. This is especially true if systems are not in place to restrict how users can view and edit database entries. Multiple users with unrestricted visibility to data can threaten its confidentiality. |
Repudiation | The size of Big Data may make event monitoring difficult or infeasible. Without proper controls for non-repudiation, an attacker may be able to change data and then plausibly deny having done so. |
Forensic | Complications include accurately securing, collecting, and evaluating Big Data sets is especially difficult because Big Data implementations often lack a consistent structure and have a variety of different sources.
|
威胁 | 描述 |
违反隐私 | 大数据是一种通常用于存储大量个人信息的解决方案。 如此庞大的数据存储量可能使攻击者更容易在一次全面攻击中窃取敏感个人信息。 |
特权升级 | 由于大数据可以代表大量信息,因此某些用户可能能够查看他们无权查看的数据。 如果没有适当的系统来限制用户查看和编辑数据库条目的方式,则尤其如此。 具有不受限制的数据可见性的多个用户可以威胁其机密性。 |
抵赖 | 大数据的大小可能使事件监视变得困难或不可行。 如果没有适当的不可否认控制,攻击者可能能够更改数据,然后似乎拒绝这样做。 |
法证 | 精确地保护,收集和评估大数据集的复杂性尤其困难,因为大数据实现常常缺乏一致的结构,并且来源多种多样。
|
Cloud
云
Big Data is a data warehouse where organizations can save a huge amount of data. Big Data is, in many cases, a cloud-based storage space. Big Data is always prepared to handle, clean, process and perform various activities on the data. Today’s business organizations have a massive amount of data, and they are saving them in the cloud as Big Data.
大数据是组织可以在其中保存大量数据的数据仓库。 在许多情况下,大数据是基于云的存储空间。 大数据始终准备处理,清理,处理和执行数据上的各种活动。 当今的商业组织拥有大量数据,并且正在将它们作为大数据保存在云中。
Big Data is not the cloud. Big Data is large, fast and diverse data. The cloud is one tool that has a solution. Effectively in house computing, set up correctly, is an internal cloud where the data is only accessible to people you directly give access to, internally. There is a major security concern on truly sensitive data in the cloud (meaning like AWS, Azure, etc.), where a foreign government, other company and their contractors all have potential access to your data, and you have limited control [12].
大数据不是云。 大数据是大型,快速且多样化的数据。 云是具有解决方案的一种工具。 有效地在房屋计算中正确设置的是内部云,内部数据只能由您直接允许其访问的人员访问。 对云中真正敏感的数据(例如,AWS,Azure等)的安全性存在重大担忧,外国政府,其他公司及其承包商都可能访问您的数据,而您的控制权有限[12] 。
Another challenge faced by organizations is the cost of data storage in the Big Data. Most companies think that Big Data will cost them much as compared to the traditional data storing methods. But this is nothing more than a myth. The cost will depend on your needs or requirements. Setting up internally requires hardware, software, maintenance and the most skilled people to set up and maintain the internal cloud. Cloud providers have the efficiency of scale that they can take advantage of for both cost, scale, co-location and speed.
组织面临的另一个挑战是大数据中数据存储的成本。 大多数公司都认为,与传统的数据存储方法相比,大数据将花费更多的成本。 但这仅是一个神话。 费用将取决于您的需求或要求。 内部设置需要硬件,软件,维护和最熟练的人员来设置和维护内部云。 云提供商在成本,规模,托管和速度两方面都可以利用规模效率。
Example Use Cases
示例用例
Organizations can quickly get lost in the wide range of the Big Data technologies available in the market. The various types of Big Data technology can confuse organizations while choosing one for their business organization or projects. If you try to explore the ocean with incomplete or partial knowledge, then you can never have a clear view of the things you expect from an application or a technology. For example, Big Data tools such as Google BigQuery and Apache Hadoop can be useful platforms for developing your own analysis tools. Third-party cloud-based apps also provide log analysis services.
组织可以Swift迷失在市场上可用的各种大数据技术中。 在为企业组织或项目选择一种时,各种类型的大数据技术可能会使组织感到困惑。 如果您尝试使用不完全或部分知识来探索海洋,那么您将永远无法清楚地了解您对应用程序或技术所期望的事物。 例如,诸如Google BigQuery和Apache Hadoop之类的大数据工具可能是用于开发自己的分析工具的有用平台。 第三方基于云的应用程序还提供日志分析服务。
Big Data in itself has no value; however, it has great potential. Big Data is used in every aspect of modern life. We use the information in everything. Since information is now easily accessible and shared, each person should be made aware of what their connection to Big Data looks like. Big Data can be used for solving problems related to efficiency by looking at how people and processes impact the overall workflow of the organization [13][14][15][16][17].
大数据本身没有价值。 但是,它具有巨大的潜力。 大数据被用于现代生活的各个方面。 我们在所有信息中使用信息。 由于现在可以轻松访问和共享信息,因此应该使每个人都知道自己与大数据的联系。 通过查看人员和流程如何影响组织的整体工作流程,大数据可用于解决与效率相关的问题[13] [14] [15] [16] [17]。
Conclusion
结论
Big Data is considered as the base of the future in the field of Information Technology. The goal of Big Data is to automate multiple processes to assist in finding value. Big Data has turned out to be one of the most encouraging and winning innovations to anticipate future patterns. It is advisable to do proper research and explore technology as much as you can.
大数据被认为是信息技术领域未来的基础。 大数据的目标是使多个流程自动化以帮助寻找价值。 大数据已成为预测未来模式的最令人鼓舞和最成功的创新之一。 建议您尽可能进行适当的研究和探索技术。
References:
参考文献:
[1]
[1]
[2]
[2]
[3]
[3]
[4]
[4]
[5]
[5]
[6]
[6]
[7]
[7]
[8]
[8]
[9]
[9]
[10]
[10]
[11]
[11]
[12]
[12]
[13]
[13]
[14]
[14]
[15]
[15]
[16]
[16]
[17]
[17]
翻译自:
python数据挖掘:概念
转载地址:http://tgqzd.baihongyu.com/