Making Data Pipelines Easy with Dataproc and Composer | DataHour by Julian Sara Joseph
In this DataHour Julian will walk you through best practices for Data engineering teams and discuss why and how to use … source
seen from Russia
seen from Malaysia
seen from China
seen from United States
seen from China
seen from China
seen from Singapore

seen from Switzerland
seen from United States
seen from Denmark
seen from United Kingdom

seen from United States
seen from Netherlands

seen from China
seen from Russia
seen from T1

seen from Netherlands
seen from United States

seen from Switzerland
seen from China
Making Data Pipelines Easy with Dataproc and Composer | DataHour by Julian Sara Joseph
In this DataHour Julian will walk you through best practices for Data engineering teams and discuss why and how to use … source
Built-In Spark UI: Real-Time Job Tracking For Spark Batches
Dataproc Serverless: More rapid, simpler, and intelligent. To provide new features that further improve the speed, ease of use, and intelligence of Dataproc Serverless.
Elevate your Spark experience with:
Native query execution: Take use of the new Native query execution in the Premium tier to see significant speed improvements.
Using Spark UI for smooth monitoring: With a built-in Spark UI that is accessible by default for all Spark batches and sessions, you can monitor task progress in real time.
Investigation made easier: Troubleshoot batch operations from a single “Investigate” page that automatically filters logs by errors and shows all the important metrics highlighted.
Using Gemini for proactive autotuning and supported troubleshooting: Allow Gemini to reduce malfunctions and adjust performance by analyzing past trends. Utilize Gemini-powered insights and suggestions to swiftly address problems.
Accelerate your Spark jobs with native query execution
By enabling native query execution, you may significantly increase the performance of your Spark batch tasks in the Premium tier on Dataproc Serverless Runtimes 2.2.26+ or 1.2.26+ without requiring any modifications to your application.Image Credit To Google Cloud
In the experiments using queries taken from the TPC-DS and TPC-H benchmarks, this new functionality in the Dataproc Serverless Premium tier increased the query performance by around 47%.
The 1TB GCS Parquet data and queries produced from the TPC-DS and TPC-H standards serve as the foundation for the performance findings. Since these runs do not meet all of the standards of the TPC-DS standard and the TPC-H standard specification, they cannot be compared to published TPC-DS standard and TPC-H standard results.
Use the native query execution qualifying tool to get started right away. It will make it simple to find tasks that qualify and calculate possible performance improvements. Once the batch tasks on the list have been identified for native query execution, you may activate it to speed up the operations and perhaps save money.
Seamless monitoring with Spark UI
Are you sick and weary of battling to manage and build up persistent history server (PHS) clusters for the sole purpose of debugging your Spark batches? Wouldn’t it be simpler to see the Spark UI in real-time without having to pay for the history server?
Up until recently, establishing and maintaining a separate Spark persistent history server was necessary for tracking and debugging Spark activities in Dataproc Serverless. Importantly, the history server has to be set up for every batch run. Otherwise, the batch job’s study of the open-source user interface would not be possible. Additionally, switching between apps was sluggish in the open-source user interface.
It have clearly heard you. It present Dataproc Serverless’s completely managed Spark UI, which simplifies monitoring and troubleshooting.
In both the Standard and Premium levels of Dataproc Serverless, the Spark UI is integrated and accessible immediately for any batch job and session at no extra cost. Just submit your task, and you can immediately begin using the Spark UI to analyze performance in real time.
Accessing the Spark UI
The “VIEW SPARK UI” link is located in the upper right corner.Image Credit To Google Cloud
With detailed insights into your Spark job performance, the new Spark UI offers the same robust functionality as the open-source Spark History Server. Browse active and finished applications with ease, investigate jobs, stages, and tasks, and examine SQL queries to have a thorough grasp of how your application is being executed. Use thorough execution information to diagnose problems and identify bottlenecks quickly.
The ‘Executors’ page offers direct connections to the relevant logs in Cloud Logging for even more in-depth investigation, enabling you to look into problems pertaining to certain executors right away.
If you have previously set up a Persistent Spark History Server, you may still see it by clicking the “VIEW SPARK HISTORY SERVER” link.
Streamlined investigation (Preview)
You may get immediate diagnostic highlights gathered in one location with the new “Investigate” option in the Batch details page.
The key metrics are automatically shown in the “Metrics highlights” area, providing you with a comprehensive view of the state of your batch task. If you want more metrics, you have the option to design a custom dashboard.Image Credit To Google Cloud
A widget called “Job Logs” displays the logs sorted by mistakes underneath the metrics highlights, allowing you to quickly identify and fix issues.
Proactive autotuning and assisted troubleshooting with Gemini (Preview)
Finally, when submitting your batch job setups, Gemini in BigQuery may assist simplify the process of optimizing hundreds of Spark attributes. Gemini can eliminate the need to go through many gigabytes of logs in order to debug the operation if it fails or runs slowly.
Enhance performance: Gemini may automatically adjust your Dataproc Serverless batch tasks’ Spark settings for optimum dependability and performance.
Simplify troubleshooting: By selecting “Ask Gemini” for AI-powered analysis and help, you may rapidly identify and fix problems with sluggish or unsuccessful tasks.
Read more on Govindhtech.com
Dataproc Metastore (DPMS) Setup patterns On Google Cloud
Big data professionals are probably already familiar with Apache Hive and the Hive Metastore, which has evolved into the industry standard for handling metadata. Running on Google Cloud, Dataproc Metastore is a fully managed Apache Hive metastore (HMS). Dataproc Metastore is serverless, self-healing, auto-scaling, and highly available. All of this facilitates interoperability between different data processing engines and whatever tools you may be utilising, and it helps you manage your metadata and data lake.
You might be looking for strategies to efficiently arrange your Dataproc Metastores (DPMS) if you are transitioning from an on-premises Hadoop setup with several Hive Metastores to Dataproc Metastore on Google Cloud. Three key considerations need to be taken into account while developing a DPMS architecture: persistence vs. federation, single-region vs. multi-region, and centralization vs. decentralisation. These design choices can have a big effect on how manageable, resilient, and scalable your metadata is.
Four patterns of DPMS deployment are examined in this blog post:
A single multi-regional centralised DPMS
DPMS per-domain centralised metadata federation
Federated decentralised metadata with per-domain DPMS
Federated ephemeral metadata
Every one of these patterns has benefits of its own to assist you choose the one that best suits the requirements of your company. The patterns are arranged in a progressively more complicated and mature order so that you can select the best pattern for the particular DPMS needs and usage of your company.
Note: A department, business unit, or functional area within your organisation is referred to as a domain in the purpose of this blog article. Every domain could have different specifications, needs for data processing, and methods for managing information.
Let’s examine each of these patterns in more detail.
1.Dataproc Metastore, a centralised multiregional system
When you have fewer domains and can combine all metastores into a single multi-regional (MR)Dataproc Metastore, this solution works well for smaller use cases.
In this approach, all of the metastores from all of the domains are combined into a single shared project, which serves as the deployment platform for a single multi-regional DPMS. With this configuration, the organization’s domain projects can all access the centralised DPMS’s metadata. Providing a clear and manageable solution for organisations with a small number of domains and a relatively basic use case is the major goal of this design.
When you build a Dataproc Metastore service, you designate a region a geographical area where your service will always be located. One region or many regions can be chosen. A multi-region is a huge geographic area that offers greater availability and encompasses two or more geographic locations. With multi-regional Dataproc Metastore services, your workloads are executed in two distinct locations while your data is stored in one. The US-central1 and US-east4 regions, for instance, are included in the multi-region nam7.image credit to google cloud
Benefits of this layout:
You may lessen the complexity of your data environment and streamline metadata administration by combining several metastores into a single DPMS.
Controlling access and permissions gets easier.
2.Per-domain DPMS and centralised metadata federation
When you have several domains, each with its own DPMS, and it is not practical to combine them into a single metastore, you can use this slightly more sophisticated approach. In these situations, you can use a fundamental building piece called metadata federation to promote cooperation and metadata exchange between domains.
A service called metadata federation allows users to access metadata from several sources via a single endpoint. These sources include Dataproc Metastore , BigQuery datasets, and Dataplex lakes as at the time this blog post was written. The gRPC (Google Remote Procedure Call) protocol is used by the federation service to expose this endpoint. In order to retrieve the necessary metadata, this protocol verifies the source ordering across metastores, which makes request processing easier. Because of its great performance, gRPC is a popular choice for developing distributed systems.
Create a federation service and then specify your metadata sources to begin federation setup. Subsequently, all of your metadata is accessible through a single gRPC endpoint that is exposed by the service. According to this design, it is the responsibility of each domain to own and operate its own Dataproc Metastores.Image credit to google cloud
The metastore federation, which combines the BigQuery and DPMS resources from each domain, is hosted by a central project. Teams can work independently, create data pipelines, and access metadata with this configuration. Teams can use the federation service to retrieve information and data from other domains as needed.
Among this design’s benefits are:
Per-domain DPMS: By giving each domain its own Dataproc Metastore, management and access control are made easier by clearly defining the boundaries for metadata and data access.
Centralised metastore federation: This system gives users a single, easily-accessible view of all metadata from all domains, giving them a thorough understanding of the ecosystem as a whole.
3.Per-domain DPMS in a decentralised metadata federation
When there are several DPMS instances some single-region and some multi-region within each domain, you utilise this rather more sophisticated approach. In order to facilitate cooperation across the domain’s metastores, you want each team within a domain to own and administer its own DPMS, but you also want a metadata federation that connects all DPMS instances inside a single domain.image credit to google cloud
Each domain in this design is in charge of managing its own Dataproc Metastores, which could be made up of many separate DPMS instances or a single, integrated MR DPMS. Within each domain, a Metastore federation is created to link Dataplex lakes, BigQuery, and one or more DPMS installations. Expanding upon the concept of metadata federation discussed in the centralised metadata federation section above, this federation service can also integrate metadata (DPMS, BigQuery, lakes) from other domains as needed.
Among this design’s benefits are:
When a DPMS fails unexpectedly, the consequences are far less than in the case of a single MR DPMS.
Because only relevant DPMS instances are included in the federation and the order in which DPMS instances are stitched dictates the order for metadata search and collision priority, the latency of searching numerous DPMS through federation is minimised.
Because only local metastores and those required for ETL are included in the federation, namespace problems are lessened.
4.Federated ephemeral metadata
We may expand the idea to allow ephemeral federation across domains by building on the prior approach, where we talked about metadata federation within a domain. When you have ETL operations that need temporary access to metadata from several DPMS instances across various projects or domains, this design is especially helpful.
This architecture dynamically stitches metastores for ETL by utilising ephemeral federation. You can establish a temporary federation with other DPMS instances from different projects when ETL tasks need access to more metadata than what is available in the domain’s DPMS or BigQuery. ETL operations can now obtain the required metadata from the additional DPMS thanks to this temporary federation. Once more, the metastore federation serves as the foundation for this.image credit to google cloud
The flexibility to dynamically specify and stitch together different DPMS instances for each ETL task or workflow as needed is a major benefit of the ephemeral federation strategy. This enables the federation to be restricted to the necessary metastores alone, as opposed to having a static, more expansive federation setup. When establishing a Dataproc cluster, the temporary federation configuration can be coordinated and incorporated into an Airflow DAG. This implies that for the period of the ETL tasks, the provisioning and deconstruction of the ephemeral federation can be completely automated.
In summary
It is essential to comprehend the advantages and disadvantages of any DPMS deployment pattern in order to match your organization’s objectives with its infrastructure. Take into account the following important factors when choosing the best design pattern:
Evaluate the intricacy of your data environment, taking into account the quantity of teams, domains, and data processing needs.
Determine whether cross-domain metadata sharing and collaboration are necessary for your company.
Think about the significance of data autonomy and the degree of metadata control that each area needs.
Establish the ideal ratio between your metadata management architecture’s flexibility and simplicity.
You can make an informed choice that ensures successful metadata management at scale by carefully weighing these aspects and comprehending the trade-offs between the various design patterns. These factors will help you find the correct balance between simplicity, scalability, cooperation, and resilience.
Read more on govindhtech.com
PDE-11 Quick, GCP Professional Data Engineer - Apache Spark, SQL, Dataproc, BigQuery, Dataflow
“You are designing storage for CSV files and using an I/O-intensive custom Apache Spark transform as part of deploying a data … source
View On WordPress
"I had some dreams, they were clouds ☁️ in my coffee☕️ ... Clouds ☁️ in my coffee ☕️ , and..." Hi All - Last week, we explored Google's fully managed "No-Ops" Cloud ☁️ DW solu
Using the Google Cloud Dataproc WorkflowTemplates API to Automate Spark and Hadoop Workloads on GCP
Using the Google Cloud Dataproc Workflow Templates API to Automate Spark and Hadoop Workloads on GCP
Introduction
In the previous post, Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service, we explored Google Cloud Dataproc using the Google Cloud Console as well as the Google Cloud SDK and Cloud Dataproc API. We created clusters, then uploaded and ran Spark and PySpark jobs, then deleted clusters, each as discrete tasks. Although each…
View On WordPress
Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service
Big Data Analytics with Java and Python, using Cloud Dataproc, Google’s Fully-Managed Spark and Hadoop Service
Introduction
There is little question, big data analytics, data science, artificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last few years. Behind the hype curves and marketing buzz, these technologies are having a significant influence on all aspects of our modern lives.
However, installing, configuring,…
View On WordPress