Data and ML
Amazon Managed Streaming for Apache Kafka (MSK)
Amazon MSK is a fully managed service for Apache Kafka.
It can be used as an alternative to Kinesis.
The default message size is 1 MB, but can be configured for higher sizes.
MSK serverless is available.
AWS Glue
AWS Glue is a managed service mainly used for ETL jobs. ETL stands for Extract, Transform and Load.
It is useful to prepare data for analytics.
It is a serverless service.
Glue Data Catalog
runs Crawlers
which are connected to data sources like S3, RDS, DynamoDB, JDBC.
A sample flow is:
S3 → Crawler → AWS Glue Data Catalog → AWS Glue Job - > ETL → S3
It can convert data into Parquet format.
Glue elastic views
combine and replicate data across multiple data stores using SQL. No custom code is needed, Glue continuously monitors for changes in the source data
Glue DataBrew
is used to clean and normalize data using pre-built transformation, all from a GUI.
Glue Studio
is a new GUI to create, run and monitor ETL jobs in Glue
Glue Streaming ETL
is a service built on Apache Spark Structured Streaming. Instead of running ETL jobs as batch jobs, they can be run as streaming jobs. Compatible with Kinesis Data Streaming, Kafka, MSK.
Amazon EMR
Amazon EMR (previously called Elastic MapReduce) is a managed Hadoop framework.
It comes bundled with Apache Spark, Hbase, Presto, Flink.+
It processes big data at scale.
- Type of nodes
-
- Primary/Master nodes
-
manage distribution of data
- Core nodes
-
run tasks and stores data
- Task nodes
-
only run tasks
AWS Lake Formation
AWS Lake Formation facilitates the aggregation of meaningful datasets for analytic purposes.
It is an easy way to setup a data lake in a matter of days.
It automates collecting, cleansing, moving, cataloging data and de-duplication using ML transforms.
The Data Lake is stored in S3.
It provides access control with column and row level security.
Fine-grained access control can be enabled.
Amazon Athena
Amazon Athena is a serverless query service to analyze data stored in sources like S3.
It can pull data from data sources into a dashboard and a search query engine. A data analyst can use Athena to analyze this data.
It can query SQL on non-SQL data sources (built on Presto).
It can support Federated Queries via Lambda functions.
The code is $5 per TB of data scanned.
Amazon Quicksight
Amazon Quicksight is used to prepare pretty dashboards and reports of analytics data. It can process and handle massive amounts of data in real time.
It uses the SPICE engine - Super-fast, Parallel, In-memory, Calculation Engine.
Quicksight can integrate with 3rd party data sources like salesforce and Jira. It can integrate with 3rd party databases and can use 3rd party imports like xlsx csv.
Amazon SageMaker
Amazon SageMaker is the premium machine learning service of AWS. It is a fully managed service to build ML models.
The flow is as such:
Historical data - > label → calculate score → build ML model → train and tune.
Amazon Rekognition
Amazon Rekognition is used to find and recognize objects in images and videos.
It provides relevant and detailed tags on an object based on its content.
It can filter out inappropriate content.
Some of the popular use cases include:
-
Object and Scene detection
-
Facial Analysis and Recogniton
-
Text in Image
-
Activity Detection
-
Unsafe content detection
-
Celebrity recognition
-
Custom Labels
-
Emotion Detection
-
Real time analysis
Amazon Polly
Amazon Polly is used to convert text into speech/voice. It is based on the Speech Synthesis Markup Language (SSML) and allows you to emphasize words, including breathing sounds, whispering and more.
Pronunciation lexicons help to customize the pronunciation of words.
Amazon Transcribe
Amazon Transcribe is used to convert speech/voice to text.
It has support to remove PII data using Redaction.
It can provide features such as speaker identification.
Amazon Translate
Amazon Translate provides fast, high quality language translations for over 5000 languages.
Translate uses neural machine translation for a better quality translated output.
Amazon Textract
Amazon Textract is used to extract text, forms, tables and signatures from scanned documents.
Amazon Lex
Amazon Lex is used to builds chatbots.
It works on the same technology that powers Alexa.
It uses Natural Language Understanding (NLU) to understand the context of the conversation.
Using Automatic Speech Recognition (ASR) it can interpret the input and convert speech to text.
Amazon Connect
Amazon Connect is a cloud-based virtual contact center that allows you to receive calls and create contact flows.
It can integrate with other CRM systems.
It is 80% cheaper than traditional contact center solutions.
Amazon Comprehend
Amazon Comprehend is a text analysis machine for Natural Language Processing (NLP).
It derives meaningful insights about the text input.
It can detect PII information.
A popular use-case is to analyze product reviews, to understand the customer’s sentiments.
Amazon Comprehend Medical
Amazon Comprehend Medical detects and returns useful information in unstructured clinical text.
Common use-cases include physician’s notes, discharge summaries, test results and case notes.
It uses NLP to detect Protected Health Information (PHI).
Amazon Forecast
Amazon Forecast is a services that uses historical data to forecast.
An example is to read the historical sales in order to predict the future sales of a raincoat.
Amazon Personalize
Amazon Personalize is used to integrate real-time, personalized recommendations into applications.
The same technology is used by amazon.com.