Data and ML

Arch Amazon Managed Streaming for Apache Kafka 48 Amazon Managed Streaming for Apache Kafka (MSK)

Amazon MSK is a fully managed service for Apache Kafka.

It can be used as an alternative to Kinesis.

The default message size is 1 MB, but can be configured for higher sizes.

MSK serverless is available.

Arch AWS Glue 48 AWS Glue

AWS Glue is a managed service mainly used for ETL jobs. ETL stands for Extract, Transform and Load.

It is useful to prepare data for analytics.

It is a serverless service.

Glue Data Catalog 35 runs Crawlers 35 which are connected to data sources like S3, RDS, DynamoDB, JDBC.

A sample flow is:

S3 → Crawler → AWS Glue Data Catalog → AWS Glue Job - > ETL → S3

It can convert data into Parquet format.

Glue job bookmarks can prevent re-processing old data. They allow you to save and track the data that has already been processed during a previous run of a Glue ETL job.

Glue elastic views 35 combine and replicate data across multiple data stores using SQL. No custom code is needed, Glue continuously monitors for changes in the source data

Glue DataBrew 35 is used to clean and normalize data using pre-built transformation, all from a GUI.

Glue Studio is a new GUI to create, run and monitor ETL jobs in Glue

Glue Streaming ETL is a service built on Apache Spark Structured Streaming. Instead of running ETL jobs as batch jobs, they can be run as streaming jobs. Compatible with Kinesis Data Streaming, Kafka, MSK.

Arch Amazon EMR 48 Amazon EMR

Amazon EMR (previously called Elastic MapReduce) is a managed Hadoop framework.
It comes bundled with Apache Spark, Hbase, Presto, Flink.+ It processes big data at scale.

Type of nodes
Primary/Master nodes

manage distribution of data

Core nodes

run tasks and stores data

Task nodes

only run tasks

Amazon EMR Serverless can be implemented to automatically provision and scale resources based on workload demands.

Arch AWS Lake Formation 48 AWS Lake Formation

AWS Lake Formation facilitates the aggregation of meaningful datasets for analytic purposes.
It is an easy way to setup a data lake in a matter of days.
It automates collecting, cleansing, moving, cataloging data and de-duplication using ML transforms.
The Data Lake is stored in S3.
It provides access control with column and row level security.
Fine-grained access control can be enabled.

Arch Amazon Athena 48 Amazon Athena

Amazon Athena is a serverless query service to analyze data stored in sources like S3.
It can pull data from data sources into a dashboard and a search query engine. A data analyst can use Athena to analyze this data.
It can query SQL on non-SQL data sources (built on Presto).
It can support Federated Queries via Lambda functions.

The code is $5 per TB of data scanned.

Arch Amazon QuickSight 48 Amazon Quicksight

Amazon Quicksight is used to prepare pretty dashboards and reports of analytics data. It can process and handle massive amounts of data in real time.

It uses the SPICE engine - Super-fast, Parallel, In-memory, Calculation Engine.

Quicksight can integrate with 3rd party data sources like salesforce and Jira. It can integrate with 3rd party databases and can use 3rd party imports like xlsx csv.

Arch Amazon SageMaker 48 Amazon SageMaker

Amazon SageMaker is the premium machine learning service of AWS. It is a fully managed service to build ML models.

The flow is as such:
Historical data - > label → calculate score → build ML model → train and tune.

Arch Amazon Rekognition 48 Amazon Rekognition

Amazon Rekognition is used to find and recognize objects in images and videos.
It provides relevant and detailed tags on an object based on its content.
It can filter out inappropriate content.

Some of the popular use cases include:

  • Object and Scene detection

  • Facial Analysis and Recogniton

  • Text in Image

  • Activity Detection

  • Unsafe content detection

  • Celebrity recognition

  • Custom Labels

  • Emotion Detection

  • Real time analysis

Arch Amazon Polly 48 Amazon Polly

Amazon Polly is used to convert text into speech/voice. It is based on the Speech Synthesis Markup Language (SSML) and allows you to emphasize words, including breathing sounds, whispering and more.

Pronunciation lexicons help to customize the pronunciation of words.

Arch Amazon Transcribe 48 Amazon Transcribe

Amazon Transcribe is used to convert speech/voice to text.

It has support to remove PII data using Redaction.

It can provide features such as speaker identification.

Arch Amazon Translate 48 Amazon Translate

Amazon Translate provides fast, high quality language translations for over 5000 languages.

Translate uses neural machine translation for a better quality translated output.

Arch Amazon Textract 48 Amazon Textract

Amazon Textract is used to extract text, forms, tables and signatures from scanned documents.

Arch Amazon Lex 48 Amazon Lex

Amazon Lex is used to builds chatbots.
It works on the same technology that powers Alexa.

It uses Natural Language Understanding (NLU) to understand the context of the conversation.

Using Automatic Speech Recognition (ASR) it can interpret the input and convert speech to text.

Arch Amazon Connect 48 Amazon Connect

Amazon Connect is a cloud-based virtual contact center that allows you to receive calls and create contact flows.
It can integrate with other CRM systems.
It is 80% cheaper than traditional contact center solutions.

Arch Amazon Comprehend 48 Amazon Comprehend

Amazon Comprehend is a text analysis machine for Natural Language Processing (NLP).
It derives meaningful insights about the text input.
It can detect PII information. A popular use-case is to analyze product reviews, to understand the customer’s sentiments.

Arch Amazon Comprehend Medical 48 Amazon Comprehend Medical

Amazon Comprehend Medical detects and returns useful information in unstructured clinical text.
Common use-cases include physician’s notes, discharge summaries, test results and case notes.

It uses NLP to detect Protected Health Information (PHI).

Arch Amazon Forecast 48 Amazon Forecast

Amazon Forecast is a services that uses historical data to forecast.
An example is to read the historical sales in order to predict the future sales of a raincoat.

Arch Amazon Kendra 48 Amazon Kendra

Amazon Kendra is a document search service powered by machine learning.

Arch Amazon Personalize 48 Amazon Personalize

Amazon Personalize is used to integrate real-time, personalized recommendations into applications.
The same technology is used by amazon.com.

Amazon Augmented AI

Amazon Augmented AI is used to integrate human reviewers to AI / ML predictions.

Arch Amazon Fraud Detector 48 Amazon Fraud Detector

Amazon Fraud Detector is used to identify fraudulent activities across the ecosystem.
It helps to detect malicious activities.