🔧 AWS Kinesis Cheat Sheet
Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to
Below are the important pointers for AWS Kinesis Service (Cheat Sheet) for AWS Certified Data Engineer-Associate Exam.
Overview of AWS Kinesis
AWS Kinesis
Kinesis Data Streams
Shards
Producers
Consumers
Kinesis Data Firehose
Delivery Streams
Destinations
Transformations
Kinesis Data Analytics
SQL Applications
Flink Applications
Kinesis Video Streams
Video Processing
Storage
Kinesis Services Comparison
Feature | Kinesis Data Streams | Kinesis Data Firehose | Kinesis Data Analytics | Kinesis Video Streams |
---|---|---|---|---|
Purpose | Real-time streaming data collection and processing | Easy data delivery to destinations | Real-time analytics on streaming data | Capture, process, and store video streams |
Retention | 24 hours (default) up to 365 days | No storage (immediate delivery) | No storage (processes in real-time) | Up to years with configurable retention |
Scaling | Manual shard provisioning | Automatic scaling | Automatic scaling | Automatic scaling |
Processing | Custom processing | Optional Lambda transformation | SQL or Apache Flink | Video processing applications |
Latency | Real-time (sub-second) | Near real-time (60s buffer) | Real-time processing | Real-time video processing |
Pricing | Per shard-hour + PUT payload units | Volume of data + optional transformation | Running time + processing resources | Data ingestion and storage |
Replayability | Yes (within retention period) | No | No (unless source data available) | Yes (within retention period) |
Detailed Features and Specifications
Kinesis Data Streams
1. Kinesis Data Streams is a scalable and durable real-time data streaming service.
2. Data is organized in shards, each providing 1 MB/s input and 2 MB/s output capacity.
3. Default data retention is 24 hours, but can be extended up to 365 days for an additional cost.
4. Maximum size of a data record is 1 MB.
5. Supports two capacity modes: Provisioned (manual shard management) and On-Demand (automatic scaling).
6. Provisioned mode requires manual shard splitting and merging for scaling.
7. On-Demand mode automatically adjusts capacity based on throughput with 2x peak observed throughput.
8. Supports two types of consumers: shared (standard) and enhanced fan-out.
9. Standard consumers share 2 MB/s output per shard with all other standard consumers.
10. Enhanced fan-out consumers get dedicated 2 MB/s throughput per shard.
11. Supports multiple consumer applications reading from the same stream (fan-out pattern).
12. Data records are ordered by partition key within each shard.
13. Partition keys are used to determine which shard a data record belongs to.
14. Sequence numbers are unique identifiers assigned to each record within a shard.
15. Supports at-least-once delivery semantics.
16. Supports resharding operations: shard splitting (increasing capacity) and shard merging (decreasing capacity).
17. Resharding doesn't affect existing data in the stream.
18. Producer throttling occurs when exceeding the 1 MB/s per shard limit (ProvisionedThroughputExceeded error).
19. Kinesis Producer Library (KPL) provides batching, compression, and retry mechanisms.
20. Kinesis Client Library (KCL) manages distributed consumption across multiple instances.
Kinesis Data Firehose
21. Kinesis Data Firehose is a fully managed service for delivering streaming data to destinations.
22. Supports destinations: S3, Redshift, Elasticsearch/OpenSearch, Splunk, and custom HTTP endpoints.
23. Automatically scales to match throughput without provisioning.
24. Buffers incoming data based on buffer size (1-128 MB) or buffer interval (60-900 seconds).
25. Supports data transformation using AWS Lambda before delivery.
26. Can convert data format to Parquet or ORC for optimized storage in S3.
27. Supports compression formats: GZIP, ZIP, and Snappy.
28. Enables dynamic partitioning of data in S3 based on data attributes.
29. Supports error logging to S3 for failed deliveries or transformations.
30. Can be integrated with Kinesis Data Streams as a source.
31. Supports server-side encryption using AWS KMS.
32. Doesn't support data replay capabilities (delivers data once).
33. Supports batching for more efficient delivery to destinations.
34. Automatically retries failed deliveries based on retry duration (0-7200 seconds).
35. Supports backup of all or failed-only data to S3.
36. Maximum data ingestion rate is 2,000 records/second or 5 MB/second per delivery stream.
37. Supports record format conversion from JSON to Parquet/ORC.
38. Supports data transformation using Lambda with maximum timeout of 5 minutes.
39. Supports inline parsing of common log formats (Apache, Nginx, etc.).
40. Supports custom delimiters for record separation.
Kinesis Data Analytics
41. Kinesis Data Analytics enables real-time processing of streaming data using SQL or Apache Flink.
42. Supports two types of applications: SQL applications and Flink applications.
43. SQL applications use standard SQL queries to process and analyze streaming data.
44. Flink applications use Apache Flink runtime for complex stream processing.
45. Automatically scales to handle varying data volumes.
46. Supports windowed operations (tumbling, sliding, and session windows).
47. Can detect anomalies and generate alerts in real-time.
48. Supports joining streaming data with reference data from S3.
49. Provides built-in functions for time-series analytics.
50. Supports integration with AWS Lambda for custom processing.
51. Can read from Kinesis Data Streams or Firehose as sources.
52. Can output to Kinesis Data Streams, Firehose, or Lambda.
53. Supports schema discovery to automatically detect data structure.
54. Provides checkpointing for fault tolerance in Flink applications.
55. Supports exactly-once processing semantics with Flink.
56. Pricing based on running time and processing resources (KPUs).
57. Each Kinesis Processing Unit (KPU) provides 1 vCPU and 4 GB memory.
58. Minimum of 1 KPU required for SQL applications.
59. Supports in-application streams for intermediate processing steps.
60. Supports reference data from S3 for enriching streaming data.
Kinesis Video Streams
61. Kinesis Video Streams is a service for capturing, processing, and storing video streams.
62. Supports real-time and batch video processing.
63. Integrates with AWS ML services for video analysis.
64. Supports WebRTC for real-time communication.
65. Provides SDKs for various platforms (iOS, Android, embedded devices).
66. Supports HLS and MPEG-DASH for video playback.
67. Enables long-term storage of video data.
68. Supports metadata tagging of video fragments.
69. Provides time-indexed access to stored video.
70. Supports encryption of video data at rest and in transit.
Performance and Scaling
71. Kinesis Data Streams throughput is determined by the number of shards.
72. Each shard provides 1 MB/s or 1,000 records per second for writes.
73. Each shard provides 2 MB/s for reads (shared across standard consumers).
74. Enhanced fan-out consumers get dedicated 2 MB/s per shard.
75. Maximum record size is 1 MB but recommended to keep under 50 KB for optimal performance.
76. Partition keys should be distributed to avoid "hot shards" (uneven distribution).
77. Calculation example: For 10 MB/s input and 20 MB/s output, you need at least 10 shards.
78. On-Demand mode scales to 2x the peak observed throughput of the previous 30 days.
79. Kinesis Data Firehose can buffer data for up to 15 minutes before delivery.
80. Firehose throughput is limited to 5 MB/s per delivery stream by default (can request increase).
81. Kinesis Data Analytics SQL applications scale in increments of 4 KPUs.
82. Flink applications can scale based on parallelism configuration.
83. Resharding operations can take several minutes to complete.
84. Producer throttling can be mitigated using exponential backoff and jitter.
85. KPL aggregation can combine multiple records into a single Kinesis record to improve throughput.
86. Hot shard problem can be addressed by using a more diverse partition key.
87. Kinesis Agent can perform pre-aggregation to optimize throughput.
88. Implementing batching in producers can significantly improve throughput.
89. Using enhanced fan-out consumers reduces contention when multiple applications read from the same stream.
90. Kinesis Data Streams supports up to 5 transactions per second for control plane operations.
Security and Compliance
91. Kinesis supports encryption at rest using AWS KMS.
92. Supports encryption in transit using HTTPS endpoints.
93. IAM policies control access to Kinesis resources.
94. Supports VPC endpoints for enhanced security.
95. Compliant with SOC, PCI DSS, HIPAA, and other compliance programs.
96. Supports server-side encryption with customer master keys (CMKs).
97. Supports resource-based policies for cross-account access.
98. Supports AWS CloudTrail for API call logging.
99. Supports tagging for resource organization and cost allocation.
100. Supports AWS PrivateLink for private connectivity.
Monitoring and Troubleshooting
Important CloudWatch Metrics
Service | Metric | Description | Threshold |
---|---|---|---|
Kinesis Data Streams | GetRecords.IteratorAgeMilliseconds | Age of the oldest record in the stream | < 30,000 ms (typical) |
Kinesis Data Streams | WriteProvisionedThroughputExceeded | Requests throttled due to exceeded shard limits | Should be near 0 |
Kinesis Data Streams | ReadProvisionedThroughputExceeded | Read requests throttled | Should be near 0 |
Kinesis Data Streams | PutRecord.Success | Successful write requests | Monitor for drops |
Kinesis Data Streams | GetRecords.Success | Successful read requests | Monitor for drops |
Kinesis Data Firehose | DeliveryToS3.Success | Successful deliveries to S3 | 100% |
Kinesis Data Firehose | ThrottledRecords | Records throttled when writing to destination | Should be 0 |
Kinesis Data Firehose | DeliveryToS3.DataFreshness | Age of the oldest record in Firehose | < buffer time + 60s |
Kinesis Data Analytics | millisBehindLatest | How far the application is behind current time | As low as possible |
Kinesis Data Analytics | FullRestarts | Number of times the application restarted | Should be 0 |
101. Monitor GetRecords.IteratorAgeMilliseconds to detect consumer processing delays.
102. High IncomingBytes with WriteProvisionedThroughputExceeded indicates need for more shards.
103. Monitor KPL's RecordsPerRequest metric to ensure efficient batching.
104. Use Enhanced Monitoring for per-shard metrics (additional cost).
105. CloudWatch Logs can capture Kinesis Data Firehose delivery errors.
106. Use CloudWatch Alarms to alert on high iterator age or throughput exceeded metrics.
107. X-Ray can be used to trace end-to-end data flow through Kinesis services.
108. Monitor Kinesis Data Analytics with CloudWatch metrics for CPU utilization and memory usage.
109. Use CloudWatch Contributor Insights to identify top contributors to your streams.
110. Monitor Kinesis Data Streams shard-level metrics for identifying hot shards.
Data Ingestion Patterns and Replayability
111. Kinesis Data Streams supports replay of data within the retention period.
112. Replay is achieved by using an older shard iterator position.
113. Replayability enables reprocessing data after fixing consumer bugs.
114. Kinesis Data Firehose doesn't support replay (one-time delivery).
115. For Firehose replayability, configure S3 backup and reprocess from S3.
116. Kinesis Data Analytics can replay by restarting the application from a specific timestamp.
117. Implementing the Lambda retry pattern with Kinesis can enhance replayability.
118. DynamoDB can store consumer checkpoints for controlled replay.
119. S3 can serve as a durable backup for long-term replayability beyond Kinesis retention.
120. Implementing the Command Query Responsibility Segregation (CQRS) pattern with Kinesis enables event sourcing.
Error Handling and Resilience
121. Implement exponential backoff for handling ProvisionedThroughputExceeded exceptions.
122. Use KCL's checkpointing to ensure at-least-once processing semantics.
123. Implement dead-letter queues for records that fail processing repeatedly.
124. Configure Kinesis Data Firehose to backup failed deliveries to S3.
125. Use try/catch blocks in Lambda transformations to handle record-level failures.
126. Implement circuit breakers for downstream service failures.
127. Use CloudWatch Alarms to detect and alert on processing failures.
128. Implement idempotent consumers to handle duplicate records.
129. Use AWS Lambda destinations for asynchronous error handling.
130. Implement retry policies with exponential backoff and jitter.
Integration with Other AWS Services
131. Amazon S3 can be a destination for Kinesis Data Firehose.
132. Amazon Redshift can load data from Kinesis via Firehose.
133. Amazon OpenSearch Service can index streaming data from Kinesis.
134. AWS Lambda can process records from Kinesis streams.
135. Amazon SQS can be used with Kinesis for buffering and retry logic.
136. Amazon SNS can be triggered by Kinesis Data Analytics alerts.
137. AWS Glue can catalog and query data delivered by Kinesis.
138. Amazon Athena can query data stored in S3 from Kinesis.
139. Amazon QuickSight can visualize analytics from Kinesis processed data.
140. AWS IoT Core can send device data to Kinesis streams.
Best Practices
141. Use appropriate partition keys to distribute data evenly across shards.
142. Implement proper error handling and retry logic in producers and consumers.
143. Use enhanced fan-out for high-throughput consumer applications.
144. Batch records in producers to improve throughput and reduce costs.
145. Monitor iterator age to detect consumer lag.
146. Use KCL for managing distributed consumption.
147. Implement proper exception handling for throttling events.
148. Use on-demand mode for unpredictable workloads.
149. Optimize record size to improve throughput.
150. Use dynamic partitioning in Firehose for cost-effective S3 storage.
Implementing Throttling and Overcoming Rate Limits
151. Implement client-side throttling to prevent server-side throttling.
152. Use exponential backoff with jitter for retry logic.
153. Implement a token bucket algorithm for rate limiting producers.
154. Pre-aggregate data before sending to Kinesis to reduce record count.
155. Use KPL aggregation to combine multiple records into a single Kinesis record.
156. Implement circuit breakers to prevent overwhelming the service during issues.
157. Monitor throttling metrics and adjust shard count proactively.
158. Use adaptive batching based on current throughput and latency.
159. Implement priority queues for critical vs. non-critical data during throttling.
160. Request service quota increases for Firehose if consistently hitting limits.
Throughput and Latency Characteristics
161. Kinesis Data Streams offers sub-second latency for real-time processing.
162. Kinesis Data Firehose has minimum latency of 60 seconds due to buffering.
163. Enhanced fan-out consumers have lower latency (70ms) compared to standard polling (200ms).
164. KPL batching increases throughput but adds 100-500ms of latency.
165. Kinesis Data Analytics SQL applications add 1-5 seconds of processing latency.
166. Flink applications can achieve lower latency than SQL applications.
167. Network latency impacts overall end-to-end latency in distributed systems.
168. Lambda processing adds variable latency based on function complexity.
169. S3 destination in Firehose adds variable latency based on object size.
170. Cross-region data transfer adds additional latency.
Open Source Components and Comparisons
171. Kinesis Data Analytics for Apache Flink uses open-source Apache Flink for stream processing.
172. Apache Flink provides more advanced stream processing capabilities than Kinesis SQL.
173. Flink supports exactly-once processing semantics, complex event processing, and stateful computations.
174. Kinesis Data Streams is similar to Apache Kafka but fully managed by AWS.
175. Kafka typically offers higher throughput but requires more operational management.
176. Kinesis has a simpler scaling model (shards) compared to Kafka's partitions and brokers.
177. Kinesis Data Firehose is comparable to Kafka Connect but with simpler integration to AWS services.
178. Kinesis Client Library (KCL) provides similar functionality to Kafka Consumer Groups.
179. Kinesis Producer Library (KPL) offers similar batching and aggregation as the Kafka Producer.
180. Flink in Kinesis Data Analytics supports the same APIs as open-source Flink but is fully managed.
Cost Optimization
181. Use On-Demand mode for unpredictable or low-volume workloads.
182. Use Provisioned mode for predictable, high-volume workloads.
183. Implement KPL aggregation to reduce the number of records (cost savings).
184. Choose appropriate retention period to balance cost and replayability needs.
185. Use enhanced fan-out only for consumers that need dedicated throughput.
186. Optimize record size to reduce PUT payload unit costs.
187. Use Kinesis Data Analytics only when real-time processing is required.
188. Consider S3 + Athena for lower-cost analytics on historical data.
189. Use dynamic partitioning in Firehose to optimize S3 storage costs.
190. Implement proper monitoring to avoid over-provisioning shards.
Exam Tips and Common Scenarios
191. Remember the throughput limits: 1 MB/s in, 2 MB/s out per shard.
192. Know the difference between standard consumers (shared 2 MB/s) and enhanced fan-out (dedicated 2 MB/s).
193. Understand when to use each Kinesis service based on requirements.
194. Know how to calculate the required number of shards based on throughput needs.
195. Understand the replayability capabilities of each Kinesis service.
196. Know the integration patterns between Kinesis and other AWS services.
197. Understand the monitoring metrics for detecting performance issues.
198. Know the difference between Provisioned and On-Demand capacity modes.
199. Understand how partition keys affect data distribution across shards.
200. Be familiar with common error scenarios and how to handle them (throttling, consumer lag, etc.).
...
🔧 AWS Kinesis Cheat Sheet
📈 49.1 Punkte
🔧 Programmierung
🔧 AWS Redshift Cheat Sheet
📈 28.04 Punkte
🔧 Programmierung
🔧 AWS S3 Service Cheat Sheet
📈 28.04 Punkte
🔧 Programmierung
🔧 AWS Databases Cheat-sheet/Write-up
📈 28.04 Punkte
🔧 Programmierung
🐧 What is AWS Kinesis Used for?
📈 26.23 Punkte
🐧 Linux Tipps
📰 AWS führt Kinesis Analytics ein
📈 26.23 Punkte
📰 IT Nachrichten
📰 AWS führt Kinesis Analytics ein
📈 26.23 Punkte
📰 IT Nachrichten
🔧 🚀 React 19 Cheat Sheet
📈 22.87 Punkte
🔧 Programmierung
🔧 Batch Script Commands Cheat Sheet :)
📈 22.87 Punkte
🔧 Programmierung