Add new topics to MLOps roadmap

pull/5650/head
Kamran Ahmed 7 months ago
parent 304efd83b6
commit 43ece4c10f
  1. 15
      src/data/roadmaps/mlops/content/100-programming-fundamentals/101-golang.md
  2. 7
      src/data/roadmaps/mlops/content/105-data-eng-fundamentals/100-data-pipelines/100-airflow.md
  3. 2
      src/data/roadmaps/mlops/content/105-data-eng-fundamentals/100-data-pipelines/index.md
  4. 7
      src/data/roadmaps/mlops/content/105-data-eng-fundamentals/102-spark-airflow-kafka.md
  5. 7
      src/data/roadmaps/mlops/content/105-data-eng-fundamentals/102-spark.md
  6. 5
      src/data/roadmaps/mlops/content/105-data-eng-fundamentals/103-data-ingestion-architecture.md
  7. 8
      src/data/roadmaps/mlops/content/105-data-eng-fundamentals/103-kafka.md
  8. 7
      src/data/roadmaps/mlops/content/105-data-eng-fundamentals/104-flink.md
  9. 1794
      src/data/roadmaps/mlops/mlops.json

@ -0,0 +1,15 @@
# Go
Go is an open source programming language supported by Google. Go can be used to write cloud services, CLI tools, used for API development, and much more.
Visit the following resources to learn more:
- [Visit Dedicated Go Roadmap](/golang)
- [A Tour of Go – Go Basics](https://go.dev/tour/welcome/1)
- [Go Reference Documentation](https://go.dev/doc/)
- [Go by Example - annotated example programs](https://gobyexample.com/)
- [Learn Go | Codecademy](https://www.codecademy.com/learn/learn-go)
- [W3Schools Go Tutorial ](https://www.w3schools.com/go/)
- [Making a RESTful JSON API in Go](https://thenewstack.io/make-a-restful-json-api-go/)
- [Go, the Programming Language of the Cloud](https://thenewstack.io/go-the-programming-language-of-the-cloud/)
- [Go Class by Matt](https://www.youtube.com/playlist?list=PLoILbKo9rG3skRCj37Kn5Zj803hhiuRK6)

@ -0,0 +1,7 @@
# Airflow
Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
Visit the following resources to learn more:
- [Airflow website](https://airflow.apache.org/)

@ -1,3 +1,5 @@
# Data Pipelines
Data pipelines refer to a set of processes that involve moving data from one system to another, for purposes such as data integration, data migration, data transformation, or data synchronization. These processes can involve a variety of data sources and destinations, and may often require data to be cleaned, enriched, or otherwise transformed along the way. It's a key concept in data engineering to ensure that data is appropriately processed from its source to the location where it will be used, typically a data warehouse, data mart, or a data lake. As such, data pipelines play a crucial part in building an effective and efficient data analytics setup, enabling the flow of data to be processed for insights.
It is important to understand the difference between ELT and ETL pipelines. ELT stands for Extract, Load, Transform, and refers to a process where data is first extracted from source systems, then loaded into a target system, and finally transformed within the target system. ETL, on the other hand, stands for Extract, Transform, Load, and refers to a process where data is first extracted from source systems, then transformed, and finally loaded into a target system. The choice between ELT and ETL pipelines depends on the specific requirements of the data processing tasks at hand, and the capabilities of the systems involved.

@ -1,7 +0,0 @@
# Spark / Airflow / Kafka
Apache Spark is an open-source distributed computing system used for big data processing and analytics. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. On the other hand, Apache Airflow is an open-source platform to programmatically author, schedule and monitor workflows. The primary use case of Airflow is to define workflows of tasks that run at specific times or in response to specific events. Apache Kafka is a distributed event streaming platform that lets you publish, subscribe to, store, and process streams of records in real time. It is often used in situations where JMS (Java Messaging Service), RabbitMQ, and other messaging systems are found to be necessary but not powerful or flexible enough.
Visit the following resources to learn more:
- [Spark By Examples](https://sparkbyexamples.com)

@ -0,0 +1,7 @@
# Spark
Apache Spark is an open-source distributed computing system used for big data processing and analytics. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Visit the following resources to learn more:
- [Spark By Examples](https://sparkbyexamples.com)

@ -0,0 +1,5 @@
# Data Ingestion Architectures
Data ingestion is the process of collecting, transferring, and loading data from various sources to a destination where it can be stored and analyzed. There are several data ingestion architectures that can be used to collect data from different sources and load it into a data warehouse, data lake, or other storage systems. These architectures can be broadly classified into two categories: batch processing and real-time processing. How you choose to ingest data will depend on the volume, velocity, and variety of data you are working with, as well as the latency requirements of your use case.
Lambda and Kappa architectures are two popular data ingestion architectures that combine batch and real-time processing to handle large volumes of data efficiently.

@ -0,0 +1,8 @@
# Kafka
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Visit the following resources to learn more:
- [Apache Kafka quickstart](https://kafka.apache.org/quickstart)
- [Apache Kafka Fundamentals](https://www.youtube.com/watch?v=B5j3uNBH8X4)

@ -0,0 +1,7 @@
# Flink
Apache Flink is a distributed stream processing framework that is used to process large amounts of data in real-time. It is designed to be highly scalable and fault-tolerant. Flink is built on top of the Apache Kafka messaging system and is used to process data streams in real-time.
Visit the following resources to learn more:
- [Apache Flink Documentation](https://flink.apache.org/)

File diff suppressed because it is too large Load Diff
Loading…
Cancel
Save