Discover Top Posts Tagged with #apachesparkdataframe

Lightning Engine: A New Era for Apache Spark Speed

Apache Spark analyses enormous data sets for ETL, data science, machine learning, and more. Scaled performance and cost efficiency may be issues. Users often experience resource utilisation, data I/O, and query execution bottlenecks, which slow processing and increase infrastructure costs.

Google Cloud knows these issues well. Lightning Engine (preview), the latest and most powerful Spark engine, unleashes your lakehouse's full potential and provides best-in-class Spark performance.

Lightning Engine?

Lightning Engine prioritises file-system layer and data-access connector optimisations as well as query and execution optimisations.

Lightning Engine enhances Spark query speed by 3.6x on TPC-H workloads at 10TB compared to open source Spark on equivalent equipment.

Lightning Engine's primary advancements are shown above:

Lightning Engine's Spark optimiser is improved by Google's F1 and Procella experience. This advanced optimiser includes adaptive query execution for join removal and exchange reuse, subquery fusion to consolidate scans, advanced inferred filters for semi-join pushdowns, dynamic in-filter generation for effective row-group pruning in Iceberg and Delta tables, optimising Bloom filters based on listing call statistics, and more. Scan and shuffle savings are significant when combined.

Lightning Engine's execution engine boosts performance with a native Apache Gluten and Velox implementation designed for Google's hardware. This uses unified memory management to switch between off-heap and on-heap memory without changing Spark settings. Lightning Engine now supports operators, functions, and Spark data types and can automatically detect when to use the native engine for pushdown results.

Lightning Engine employs columnar shuffle with an optimised serializer-deserializer to decrease shuffle data.

Lightning Engine uses a parquet parser for prefetching, caching, and in-filtering to reduce data scans and metadata operations.

Lightning Engine increases BigQuery and Google Cloud Storage connection to speed up its native engine. An optimised file output committer boosts Spark application performance and reliability, while the upgraded Cloud Storage connection reduces metadata operations to save money. By providing data directly to the engine in Apache Arrow format and eliminating row-to-columnar conversions, the new native BigQuery connection simplifies data delivery.

Lightning Engine works with SQL APIs and Apache Spark DataFrame, so workloads run seamlessly without code changes.

Lightning Engine—why?

Lightning Engine outperforms cloud Spark competitors and is cheaper. Open formats like Apache Iceberg and Delta Lake can boost business efficiency using BigQuery and Google Cloud's cutting-edge AI/ML.

Lightning Engine outperforms DIY Spark implementations, saving you money and letting you focus on your business challenges.

Advantages

Main lightning engine benefits

Faster query performance: Uses a new Spark processing engine with vectorised execution, intelligent caching, and optimised storage I/O.

Leading industry price-performance ratio: Allows customers to manage more data for less money by providing superior performance and cost effectiveness.

Intelligible Lakehouse integration: Integrates with Google Cloud services including BigQuery, Vertex AI, Apache Iceberg, and Delta Lake to provide a single data analytics and AI platform.

Optimised BigQuery and Cloud Storage connections increase data access latency, throughput, and metadata operations.

Flexible deployments: Cluster-based and serverless.

Lightning Engine boosts performance, although the impact depends on workload. It works well for compute-intensive Spark Dataframe API and Spark SQL queries, not I/O-bound tasks.

Spark's Google Cloud future

Google Cloud is excited to apply Google's size, performance, and technical prowess to Apache Spark workloads with the new Lightning Engine data query engine, enabling developers worldwide. It wants to speed it up in the following months, so this is just the start!

Google Cloud Serverless for Apache Spark and Dataproc on Google Compute Engine premium tiers demonstrate Lightning Engine. Both services offer GPU support for faster machine learning and task monitoring for operational efficiency.

#ApacheSpark #LightningEngine #BigQuery #CloudStorage #ApacheSparkDataFrame #Sparkengine #technology #technews #technologynews #news #govindhtech