The big data ecosystem is updated again! Containerization into a big trend

The big data ecosystem is updated again! Containerization into a big trend

The full text is 2737 words and the expected learning time is 5 minutes
Image source:

Recently, something happened in the field of big-ecological data systems: Cisco combined the artificial intelligence hardware framework with a new deep learning server powered by eight GPUs. Wikibon principal analyst James Kobielus said in a recent interview that Cisco is committed to supporting the development of Kubeflow in the field of artificial intelligence. "Kubeflow is an open source tool that makes TensorFlow compatible with the Kubernetes container orchestration engine."

TensorFlow is an open source software library for numerical calculations. Its flexible architecture can be easily applied to the deployment of various platforms (GPUs, TPUs, CPUs) and multiple devices (desktop computers, server clusters, various mobile and edge devices). TensorFlow was originally developed by the Google Brain team (part of Google's artificial intelligence department). It has a flexible numerical computing core and is a good helper for machine learning and deep learning. They developed a new type of deep learning server powered by eight CPUs.

James Kobielus believes that containerization is leading the software industry to a new era. Containerization is reshaping the pattern of almost every information technology software platform, and has a certain impact in the field of artificial intelligence and machine learning. For example, Cisco recently announced that it is improving the containerization of the TensorFlow stack. Kobielus said:

When I talk about highly complex AI, I mean something like TensorFlow. For example, when a user builds a deep learning model in TensorFlow, it is assumed that the model will be used to develop self-driving cars. Of course, the deep learning model will be preset in the car, and the sensor data can be used for object recognition and other functions. Within the control area of the car, there will also be a deep learning model, which may target traffic jams in a given area.

According to Kobielus, Apache Spark often runs as a persistence layer or storage layer together with the Hadoop Distributed File System (HDFS). Spark is one of the first choices for machine learning development environment, which is memory-oriented. It is increasingly used for real-time ETL and data preparation for several hybrid deployments equipped with TensorFlow, and it also tends to be containerized.


Software containers enable companies to easily move workloads between different environments. Essentially, Kubeflow is a framework and tool based on Kubeness for building and training machine learning models. These models may be packaged from the beginning. Some of the main topics in container research include Kubernetes orchestration, machine learning, and deep learning.

For all application development, the containerization of DevOps workflows is quickly becoming the norm. Kobielus said this is especially true in the development of artificial intelligence applications. "Kubeflow enables DevOps to manage these applications point-to-point in a container orchestration environment." Kubeflow is becoming a key glue in the smart device industry (including the field of artificial intelligence devices) and supports the containerization of artificial intelligence. Azure's new machine learning programs support container-based model management and development, as does Apache Spark.

He said that Kubeflow makes a "scaled" machine learning model and then deploys it to production in the simplest possible format. Because machine learning researchers use different tools, the main goal is to customize the stack according to user needs and provide an easy-to-use machine learning stack wherever it is already running within Kubernetes.

Image source:

Machine learning

Machine learning has developed into a form of data analysis used to identify patterns and predict probabilities, and it exists as a branch of artificial intelligence research. By providing data to the model with "known" answers, the computer can train itself to predict future reactions to unknown situations. Machine learning has achieved considerable success in solving specific tasks, and it is estimated that AI and ML will be the main catalysts for promoting cloud computing. In order to work effectively, machine learning technology needs to learn efficiently and be combined with cloud technology, including containerization.

With this in mind, Google developed Kubeflow, a portable, composable, and scalable machine learning stack built on Kubernetes. Kubeflow provides an open source platform that can transfer ML models by attaching themselves to the container and perform calculations next to the data instead of in the overlay. Kubeflow helped solve the basic problem of implementing the ML stack. The construction of production-level machine learning solutions requires multiple data types. Sometimes, using different tools to build the stack can complicate the algorithm and produce inconsistent results.

Advantages of deep learning

Deep learning is a branch of machine learning that supports deep neural network computers to "learn from experience" and use hierarchical order to understand the world. This hierarchical structure supports the use of complex concepts by computers by building complex concepts on the basis of simple concepts. Real-world organizations have combined machine learning and open source platform technologies, something the original developers of these independent open source projects never expected. Kobielus said:

I think that in order to bring the cloud computing revolution to every device, deep learning and AI are very important and necessary. We have achieved comprehensive development in the field of mobile computing. AI technology will be applied to everyone and every machine, such as smart devices and autonomous devices.

Innovations of this kind have already appeared in the fields of face recognition and speech recognition. However, it needs to be carried out in a standardized way, or applied to the edge deployment environment through standardized cloud computing, that is, to achieve containerization and use Kubernetes. He continued:

As a developer, I think the key is to be able to package the models that perform different tasks, and to be able to connect these models together according to the orchestration, so that they can run together as components in a distributed application environment. In addition, this enables these models to be monitored and managed in real time, which is generally implemented by casting a plane.

Eclipse and the Cloud Native Computing Foundation (CNCF) recently announced that they are collaborating to build a containerized open source stack and the tools needed to deploy deep learning and machine learning containers to edge devices. The Eclipse Foundation provides a business-friendly environment for open source software, innovation and collaboration.

A few months ago, the Eclipse Foundation initiated a project called Ditto, which was sponsored by Bosch. The focus of the project is to use digital twin technology to develop artificial intelligence, which is designed to run in a containerized manner on edge devices.

Image source:

Data management

Data management is about managing and maintaining data and metadata assets. In the interview, Kobielus stated:

I like to use the word'management'. The industry manages this stack according to several levels. The community decides what is accepted as a project, what is submitted to a working group to build, and then what finally rises from the sandbox and is incubated in some management of this community. There are supplier supervision, that is, each supplier, cloud supervision and server supervision.

Kobielus believes that this type of data management is a necessary part of this new era. Some things will be generally accepted by the public and start their own development. Some things will be abandoned halfway, such as when Hadoop started, he said:

I remember some snippets of Hadoop, such as the Mahout machine learning library. Some have been adopted, but they have not yet reached the level of the Spark library.

He believes that data scientists are the core developers of artificial intelligence, but they have not yet realized that they need to know more about containers and more about Kubernetes, "because this will appear in their toolbox, as Target environment to deploy their model."

He concluded by saying that data scientists, artificial intelligence developers, data architects, and others in the industry all need to understand how and why these new technologies are now a core component of their data stack. Everyone involved needs to understand this, otherwise they will only be abandoned by the advancing trend of the data age.

Comment Like Follow

Let's share the dry goods of AI learning and development.
Welcome to pay attention to the full platform AI vertical self-media "core reading"

(Add the editor WeChat: dxsxbb, join the reader circle, and discuss the freshest artificial intelligence technology together~)