Determine the Tools and Techniques Required for Analysis
Amazon Athena
- It is an interactive query service
- Easily analyze data in S3 using standard SQL.
- It is serverless
- No infrastructure to manage
- Pay only for the queries that you run.
Running AWS Athena
- Point to data in S3
- Define the schema
- start querying using standard SQL.
- Most results are delivered within seconds.
- No need for complex ETL jobs to prepare data for analysis.
- Anyone with SQL skills can quickly analyze large-scale datasets.
Well integrated with AWS Glue Data Catalog,
- to create a unified metadata repository across various services
- crawl data sources to discover schemas
- populate Catalog with new and modified table
- maintain schema versioning
- Can also use Glue’s ETL capabilities.
Amazon EMR
- It is a managed Hadoop framework
- Simplifies running big data frameworks – Apache Hadoop, Apache Spark, HBase, Presto, and Flink on AWS
- Process and analyze vast amounts of data.
- Uses Apache Hive and Apache Pig, to process data for analytics and BI.
- Use to transform and move large amounts of data into and out of other AWS data stores and databases.
- Can interact with data in other AWS data stores like S3, DynamoDB.
EMR Notebooks
- Is based on the popular Jupyter Notebook
- provide a development and collaboration environment for ad hoc querying and exploratory analysis.
Amazon CloudSearch
- It is a managed service
- To set up, manage, and scale a search solution for website or application.
- Supports 34 languages
- Supported search features
- Highlighting
- Autocomplete
- geospatial search
Amazon Elasticsearch Service
- Used to deploy, secure, operate, and scale Elasticsearch
- Elasticsearch is used to search, analyze, and visualize data in real-time.
- Access APIs and real-time analytics capabilities
- Useful for
- log analytics
- full-text search
- application monitoring
- clickstream analytics
- Integrations with Kibana and Logstash
- Integrates with other AWS services Amazon VPC, AWS KMS, Amazon Kinesis Data Firehose, AWS Lambda, AWS IAM, Amazon Cognito, and Amazon CloudWatch.
Amazon Kinesis
- Used to collect, process, and analyze real-time, streaming data
- Easily get timely insights and react quickly to new information.
- Offers flexibility to choose tools.
- Ingest real-time data such
- Can process and analyze data as it arrives and respond instantly instead of waiting
- Currently offers four services
- Kinesis Data Firehose
- Kinesis Data Analytics
- Kinesis Data Streams
- Kinesis Video Streams
Amazon Redshift
- It is a fast, scalable data warehouse
- Used to analyze all data across data warehouse and data lake.
- Integrates with machine learning, parallel query execution, and columnar storage.
- Setup and deploy a new data warehouse in minutes
- Run queries across petabytes in Redshift, and exabytes in data lake.
Amazon QuickSight
- It is a fast, cloud-powered business intelligence (BI) service
- Used to deliver insights.
- Create and publish interactive dashboards
- Dashboards accessible from browsers or mobile devices.
- Embed dashboards into applications for self-service analytics
- Easily scales without any software to install or infrastructure to manage.
AWS Data Pipeline
- It is a web service
- Used to reliably process and move data
- Move between different AWS services, on-premises data sources, at specified intervals.
- Regularly access data where it’s stored, transform and process it at scale
- Transfer the results to AWS services.
- Easily create complex data processing workloads
that are
- fault tolerant
- repeatable
- highly available
AWS Glue
- Fully managed ETL service
- Easily prepare and load data for analytics.
- Create and run an ETL job in AWS Management Console.
- Point to data stored on AWS, data and associated metadata is discovered in Glue Data Catalog.
- Once cataloged, data is immediately searchable, queryable, and available for ETL.
AWS Certified Big Data - Specialty Free Practice TestTake a Quiz