Cookie Consent by Free Privacy Policy Generator Aktuallisiere deine Cookie Einstellungen ๐Ÿ“Œ Clustering in BigQuery (Dynamic Filters & External Queries)


๐Ÿ“š Clustering in BigQuery (Dynamic Filters & External Queries)


๐Ÿ’ก Newskategorie: Programmierung
๐Ÿ”— Quelle: dev.to

BigQuery is not like the other SQL languages in that one key thing does not exist... indexes!

Instead we have clustering to work with to improve not just peformance but cost.

BigQuery's pricing model is based on querying costs, and those querying costs are proportional to how much data is read in each query. As such a standard SELECT * FROM dataset.table can start to be very expensive when that table is reaching towards 100s of GBs. Using clustering to reduce the amount of data being read is essential.

Clustering

I'm not going to re-explain what BigQuery clustering is as I believe the docs do it quite well. For a very basic use-case, that is.

When suddenly you have a requirement that is filtering a clustered column on say, subquery results, it all falls apart. I do hope this is something Google fixes, or at least creates more clear documentation on because one of my main findings is that clustering is very inconsistent.

Execution Plan

When we came across a single query that cost $30 (and needed to be run regularly!), I soon scrambled for a way to find out why exactly my supposed clustering optimisation was not working.

Take a look at these two similar but oh so different execution stages:

$1:Column1, $2.ClusteredColumn, $3.Column2
FROM project.dataset.clusteredTable
WHERE in_array($2, ARRAY<...>)
$1:Column1, $2.ClusteredColumn, $3.Column2, $4.FilterColumn
FROM project.dataset.clusteredTable
SEMI HASH JOIN project.dataset.filterTable EACH WITH ALL ON $4 = $2

Image description

Although seemingly performing the same job, there is some mysterious magic in BigQuery's back-end which means clustering is only used in the first case. Therefore the new goal is to write queries that consistently result in the first execution plan stage.

Dynamic Queries

If you are wanting to take advantage of clustering by dynamic results in a single query the unfortunate answer is, I don't think it's possible. To result in the WHERE clause in the execution plan, the clustered column must be filtered on a static value.

Luckily, we have multi-statement queries to save the day!

My specific use-case with BigQuery is having a lot of views to external query connections and a few large base tables. This is my company's current hybrid setup after a previously fully MySQL solution to get the best of both worlds out of MySQL's transactional nature and BigQuery's big data storage. Therefore, a lot of our queries depend on first filtering results of these external views before joining to our base tables.

Here's how I ensured that clustering is always used in the execution plan.

Non-Optimised for Clustering:

SELECT column1, column2
FROM dataset.clusteredTable c
JOIN dataset.filteredTable f ON c.clusteredcolumn = f.joinColumn
WHERE f.filterColumn = 'Some Value'

Optimised for Clustering:

DECLARE filters_variableARRAY<STRING> DEFAULT (
    SELECT ARRAY_AGG(joinColumn)
    FROM dataset.filteredTable
    WHERE filterColumn = 'Some Value'
);
SELECT column1, column2
FROM dataset.clusteredTable c
WHERE c.clusteredcolumn IN UNNEST(filters_variable)

Although not the prettiest query, I found using declared variables as in the optimised version to be the only way to consistently ensure that query costs were low. Using CTEs or subqueries did not have the same affect.

Limitations

Of course, nothing is perfect and there are some limitations with using DECLARE to create array variables each time you need to filter on our clustered column.

  1. Unreadable queries. Especially when you start to need multiple variables, things can soon get out of hand
  2. Size limitations. If your filter array has too many values than you can hit a BigQuery limit.

Script expression exceeded evaluation limit of 1048576 bytes.

Ultimately, BigQuery was designed for denormalised data where ideally the values you need to filter by are within the table itself which is exactly what clustering was designed for. However, if like us you have migrated a relational data structure into BigQuery, things may not always be as simple without entirely re-designing all your schemas.

Using this hack to ensure clustering is consistently applied to queries can be a good cost-saving measure to force BigQuery to play nice with normalised data. :)

...



๐Ÿ“Œ Clustering in BigQuery (Dynamic Filters & External Queries)


๐Ÿ“ˆ 85.73 Punkte

๐Ÿ“Œ Top Three Clustering Algorithms You Should Know Instead of K-means Clustering


๐Ÿ“ˆ 37.33 Punkte

๐Ÿ“Œ Automatically Updating a BigQuery Table Using an External API and a Cloud Function


๐Ÿ“ˆ 28.22 Punkte

๐Ÿ“Œ Can someone suggest me some nodejs packages or workarounds that can convert natural language queries into mongodb queries.


๐Ÿ“ˆ 27.58 Punkte

๐Ÿ“Œ Enhancing Data Queries with Firebase: Using Compound Queries to Filter Results


๐Ÿ“ˆ 27.58 Punkte

๐Ÿ“Œ Elasticsearch Tutorial: A Deep Dive into Filters and Compound Queries


๐Ÿ“ˆ 26.56 Punkte

๐Ÿ“Œ [local] Andrea ST Filters Service 1.0.64.7 - 'Andrea ST Filters Service ' Unquoted Service Path


๐Ÿ“ˆ 25.54 Punkte

๐Ÿ“Œ #0daytoday #Andrea ST Filters Service 1.0.64.7 - (Andrea ST Filters Service) Unquoted Service Path [#0day #Exploit]


๐Ÿ“ˆ 25.54 Punkte

๐Ÿ“Œ CVE-2020-11812 | Rukovoditel 2.5.2 filters[0][value]/filters[1][value] sql injection


๐Ÿ“ˆ 25.54 Punkte

๐Ÿ“Œ t3n Daily: Adobe &amp;amp; Figma, Ethereum &amp;amp; NFT, Steuer &amp;amp; Homeoffice, KI &amp;amp; Gruselfrau


๐Ÿ“ˆ 25.19 Punkte

๐Ÿ“Œ Writing dynamic Ecto queries with Composite


๐Ÿ“ˆ 23.97 Punkte

๐Ÿ“Œ Flutter web support updates, BigQuery materialized views, Cloud Spanner emulator, & more!


๐Ÿ“ˆ 21.8 Punkte

๐Ÿ“Œ MySQL to BigQuery Migration (Syntax & Other Conversions)


๐Ÿ“ˆ 21.8 Punkte

๐Ÿ“Œ Introduction to the ELF Format (Part VII): Dynamic Linking / Loading and the .dynamic section


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Medium CVE-2017-18604: Sitebuilder dynamic components project Sitebuilder dynamic components


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Dynamic Cow: App bringt Dynamic Island auf alte iPhones


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ CVE-2023-31032 | NVIDIA DGX A100 prior 1.25 dynamic dynamic variable evaluation


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Netlify Dynamic Site Challenge Submission: Dynamic Image Gallery with Netlify Image CDN Visual Feast


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Letโ€™s Get Dynamic! Ideas for the Netlify Dynamic Site Challenge


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Netlify Dynamic Site Challenge : Building a Dynamic Image Gallery with Netlify Image CDN


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Netlify Dynamic Site Challenge Submission: Dynamic Image Gallery with Netlify Image CDN Visual Feast


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Netlify Dynamic Site Challenge : Building a Dynamic Image Gallery


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Dynamic Report Generation in Laravel: Introducing `laravel-dynamic-report-generator`


๐Ÿ“ˆ 20.36 Punkte

๐Ÿ“Œ Searching for Bitcoins in GitHub repositories with Google BigQuery


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ BigQuery Demo - The State of the Web


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ BigQuery - The State of the Web


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ Android Studio 3.5, Cloud Run Button, BigQuery Terraform module


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ Feature engineering in BigQuery and TensorFlow 2.0/Keras - Kirkland ML Summit โ€˜19


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ Mastering the Chrome UX Report on BigQuery


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ Proactive Hunting with Certificate Transparency Log and Google BigQuery


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ Machine Learning: Google erweitert BigQuery ML um drei nichtlineare Modelltypen


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ How to build and deploy a demand forecasting solution with BigQuery ML


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ Sleep API, GKE Autopilot, BigQuery BI Engine, and more!


๐Ÿ“ˆ 19.71 Punkte

๐Ÿ“Œ Measure and debug performance with Google Analytics 4 and BigQuery


๐Ÿ“ˆ 19.71 Punkte











matomo