Ausnahme gefangen: SSL certificate problem: certificate is not yet valid 📌 Pandas Isn’t Enough. Learn These 25 Pandas to SQL Translations To Upgrade Your Data Analysis Game

🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security



📚 Pandas Isn’t Enough. Learn These 25 Pandas to SQL Translations To Upgrade Your Data Analysis Game


💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

25 common SQL Queries and their corresponding methods in Pandas.

Photo by James Yarema on Unsplash
This is my 50th article on Medium. Thank you so much for reading and appreciating my work 😊! It’s been an absolutely rewarding journey.
If you like reading my articles here on Medium, I am sure you will love this as well: The Daily Dose of Data Science.
What is this? It’s a data-science oriented publication that I run on substack.
What will you get from this? Here I present elegant and handy tips and tricks around Data-science/Python/Machine Learning, etc., one tip a day (See publication archive here). If you are interested, you can subscribe to receive the daily doses right in your inbox. And it’s completely free. Would love to see on the other side!

Motivation

SQL and Pandas are both powerful tools for data scientists to work with data.

SQL, as we all know, is a language used to manage and manipulate data in databases. On the other hand, Pandas is a data manipulation and analysis library in Python.

Moreover, SQL is often used to extract data from databases and prepare it for analysis in Python, mostly using Pandas, which provides a wide range of tools and functions for working with tabular data, including data manipulation, analysis, and visualization.

Together, SQL and Pandas can be used to clean, transform, and analyze large datasets, and to create complex data pipelines and models. Therefore, proficiency in both frameworks can be extremely valuable to data scientists.

Therefore, in this blog, I will provide a quick guide to translating the most common Pandas operations to their equivalent SQL queries.

Let’s begin 🚀!

Dataset

For demonstration purposes, I created a dummy dataset using Faker:

Random Employee Dataset (Image by author)

#1 Reading a CSV file

Pandas

CSVs are typically the most prevalent file format to read Pandas DataFrames from. This is done using the pd.read_csv() method in Pandas.

https://medium.com/media/de7bff6f3bd7587f0803cb58fceb7532/href

SQL

To create a table in your database, the first step is to create an empty table and define its schema.

https://medium.com/media/fae58e546b0d1ad4f92012d9a9bd1d0d/href

The next step is to dump the contents of the CSV file (starting from the second row if the first row is the header) into the table created above.

https://medium.com/media/2535dee45ca597ffe93472cdb2254d2e/href

Output

We get the following output after creating a DataFrame/Table:

Output after reading the CSV (Image by Author)

#2 Displaying the First 5 (or k) Rows

Pandas

We can use the df.head() method in Pandas.

https://medium.com/media/8860306967735a5a9b426df80eacebb0/href

SQL

In MySQL Syntax, we can use limit after select and specify the number of records we want to display.

https://medium.com/media/5204ff33f0ba34c0e36bde122c7a2d68/href

#3 Printing the Dimensions

Pandas

The shape attribute of a DataFrame object prints the number of rows and columns.

https://medium.com/media/85d33a47b627002da11319f3f6d81612/href

SQL

We can use the count keyword to print the number of rows.

https://medium.com/media/7eda1d885ee5798538d15056dc84d96e/href

#4 Printing the Datatype

Pandas

You can print the datatype of all columns using the dtypes argument:

https://medium.com/media/5d98812fda2550c08171a4d90567e881/href

SQL

Here, you can print the datatypes as follows:

https://medium.com/media/9194326111b031bae20acfe1c8651a78/href

#5 Modifying the Datatype of a column

Pandas

Here, we can use the astype() method as follows:

https://medium.com/media/9e825477211c3eeb08af5838f3395c9b/href

SQL

Use ALTER COLUMN to change the datatype of the column.

https://medium.com/media/bbc0ba8facd54b29e5ec4aa8724be798/href

The above will permanently modify the datatype of the column in the table. However, if you just wish to do that while filtering, use cast.

https://medium.com/media/77b98a7e509d20e4904fc82713551440/href

#6–11 Filtering the Data

There are various ways to filter dataframe in Pandas.

#6: You can filter on one column as follows:

https://medium.com/media/0369a934be7a15bd3ed6c72c5363ddb9/href

The above can be translated to SQL as follows:

https://medium.com/media/88b2deb92dd98ddc416b406a625027ea/href

#7: Furthermore, you can filter on multiple columns as well:

https://medium.com/media/9d2c756ddf0e71ba7738094bffbf918c/href

The SQL equivalent of the above filtering is:

https://medium.com/media/7f92843494e0c61452a66832d292b8b3/href

#8: You can also filter from a list of values using isin():

https://medium.com/media/33bb5bfed1237e61b63477f1f3d7de2c/href

To mimic the above, we have in keyword in SQL:

https://medium.com/media/89050bf6b4c005616c7d0c50c1c1ece1/href

#9: In Pandas, you can also select a particular column using the dot operator.

https://medium.com/media/6fa7c3ca685cb42344f33e0f21291c55/href

In SQL, we can specify the required column after select.

https://medium.com/media/7032e006d8e4921df961ed34bb23c155/href

#10: If you want to select multiple columns in Pandas, you can do the following:

https://medium.com/media/a4b31b3ef0a8ad81ff5074f5bcda2761/href

The same can be done by specifying multiple columns after select in SQL.

https://medium.com/media/9d82196e67bcf1a4c102b23c868c83bb/href

#11 You can also filter based on NaN values in Pandas.

https://medium.com/media/dfc3ecbded294d894eec4bd9bc30844a/href

The same can be extended to SQL as follows:

https://medium.com/media/22935983fd98f86469c39f40ee870a5f/href

#12 We can also perform some complex pattern-based string filtering.

https://medium.com/media/a1443ca66a9c2ab16604e9b2ba855700/href

In SQL, we can use the LIKE clause.

https://medium.com/media/ba4a5dcbef3a71cb07dc055cbe4cd54e/href

#13 You can also search for a substring within a string. For instance, say we want to find all the records in which last_name contains the substring “an”.

In Pandas, we can do the following:

https://medium.com/media/f1eab88cc0bdc10a3d29e5b714670d76/href

In SQL, we can again use the LIKE clause.

https://medium.com/media/171b2d9613fb92d31b8c3ac503d0d807/href

#14–16 Sorting Data

Sorting is another typical operation that Data Scientists use to order their data.

Pandas

Use the df.sort_values() method to sort a DataFrame.

https://medium.com/media/1e68e157e74787fbf5714490d1ba7bd3/href

You can also sort on multiple columns:

https://medium.com/media/1fb78fae76e90afd16f1469b21c0ddd5/href

Lastly, we can specify different criteria (ascending/descending) for different columns too using the ascending parameter.

https://medium.com/media/5f50304d47bdf1422dd350fd1d177e4b/href

Here, the list corresponding to ascending indicates that last_name is sorted in descending order and level in ascending order.

SQL

In SQL, we can use order by clause to do so.

https://medium.com/media/f359e5f1631c150b0a466cee92f78b9b/href

Furthermore, by specifying more columns in the order by clause, we can include more columns for sorting criteria:

https://medium.com/media/3ef075f5a369478b8ddf9fe414ccc9f7/href

We can specify different sorting orders for different columns as follows:

https://medium.com/media/d0e9d214384081a2e6e482717a9d519e/href

#17 Fill NaN values

For this one, I have intentionally removed a couple of values in the salary column. This is the updated DataFrame:
https://medium.com/media/4e9b3925c970b110eb96a7dd08b04ed5/href

Pandas

In Pandas, we can use the fillna() method to fill NaN values:

https://medium.com/media/12fa74b6e637a741fa7c9fb9c60484ed/href

SQL

In SQL, however, we can do so using the case statement.

https://medium.com/media/0ee0e604ec3b9b65240bf0987a5e5b03/href

#18–19 Joining Data

Pandas

If you want to merge two DataFrames with a joining key, use the pd.merge() method:

https://medium.com/media/708b3eb18008ff1056eb2ee4e1590109/hrefhttps://medium.com/media/c62a83025434a48e434ab7c797f60c00/href

SQL

https://medium.com/media/05e1b4f5d1626f651a2718ed32332454/href

Another way to join datasets is by concatenating them.

Pandas

Consider the DataFrame below:

https://medium.com/media/696fe4846558e2622be7ae1f70427797/href

In Pandas, you can use the concat() method and pass the DataFrame objects to concatenate as a list/tuple.

https://medium.com/media/dbc16d1716cee45b90c8bf62a806d12e/href

SQL

The same can be achieved with UNION (to keep only unique rows) and UNION ALL (to keep all rows) in SQL.

https://medium.com/media/bc713d067e9f7af0ceb59f9de98f0f46/href

#20 Grouping Data

Pandas

To group a DataFrame and perform aggregations, use the groupby() method in Pandas, as shown below

https://medium.com/media/4004f245662b376c206c2636fa2e4ede/href

SQL

In SQL, you can use the group by clause and specify aggregations in the select clause.

https://medium.com/media/ed9bcbbfd001503c6f9748dcd720c3d0/href

And we do see the same outputs!

#21–22 Finding Unique Values

Pandas

To print the distinct values in a column, we can use the unique() method.

https://medium.com/media/df37114cb62e0ec3ec8a611af32865ee/href

To print the number of distinct values, use the nunique() method.

https://medium.com/media/b66bc4bc4af7a01b2380e4897ed57d39/href

SQL

In SQL, we can use the DISTINCT keyword in select as follows:

https://medium.com/media/6d27babcf9bf73e43fe6f96bc9671d8c/href

To count the number of distinct values in SQL, we can wrap the COUNT aggregator around distinct.

https://medium.com/media/4c074a8d5c3df3ca661f887442351249/href

#23 Renaming Column

Pandas

Here, use the df.rename() method, as demonstrated below:

https://medium.com/media/02a5b529136215ac56bae1090b09bfe9/href

SQL

We can use ALTER TABLE to rename a column:

https://medium.com/media/fc36fe594f1f230f9a184bd15129c358/href

#24 Deleting Column

Pandas

Use the df.drop() method:

https://medium.com/media/8896b041aabcc0b04e478b1284439e15/href

SQL

Similar to renaming, we can use ALTER TABLE and change RENAME to DROP.

https://medium.com/media/dabbafc7e226ff7c02e72914c71c0602/href

#25 Creating a New Column

Say we want to create a new column full_name, which is the concatenation of columns first_name and last_name, with a space in between.

Pandas

We can use a simple assignment operator in Pandas.

https://medium.com/media/0c06bf28ed6824a9b315597b77554d87/href

SQL

In SQL, the first step is to add a new column:

https://medium.com/media/7f1b629b378318cb5e7d42c4957cb9dc/href

Next, we set the value using SET in SQL.

https://medium.com/media/32f1cf3156ff80f10095eae6b733763f/href

Conclusion

Congratulations! You now know the SQL translation of the most common methods in Pandas.

I have tried to cover translations for most of the data scientists use on a regular basis in Pandas. However, I understand I might have missed a few.

Do let me know in the responses.

As always, thanks for reading!

Image Generated and Edited by Author using Stable Diffusion.

🚀 Subscribe to the Daily Dose of Data Science. Here, I share elegant tips and tricks on Data Science, one tip a day. Receive these tips right in your inbox daily.

🧑‍💻 Become a Data Science PRO! Get the FREE Data Science Mastery Toolkit with 450+ Pandas, NumPy, and SQL questions.

Get a weekly summary of the top 1% Research papers, News, Open Source Repos, and Tweets in Machine Learning.


Pandas Isn’t Enough. Learn These 25 Pandas to SQL Translations To Upgrade Your Data Analysis Game was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...



📌 Pandas Isn’t Enough. Learn These 25 Pandas to SQL Translations To Upgrade Your Data Analysis Game


📈 120.17 Punkte

📌 Worrying about data privacy isn’t enough: Here’s how to own your online presence


📈 30.39 Punkte

📌 Pandas - Visualizing Dataframe Data - 7 Days of Pandas


📈 30.24 Punkte

📌 Level Up Your Pandas Game with These 15 Hidden Gems


📈 30.06 Punkte

📌 Mozilla Community Call - Learn more about Firefox Translations!


📈 29.53 Punkte

📌 Pandas & Python Tricks for Data Science & Data Analysis — Part 2


📈 28.69 Punkte

📌 Pandas & Python Tricks for Data Science & Data Analysis — Part 3


📈 28.69 Punkte

📌 Pandas & Python Tricks for Data Science & Data Analysis — Part 4


📈 28.69 Punkte

📌 Pandas and Python Tips and Tricks for Data Science and Data Analysis


📈 28.69 Punkte

📌 Pandas & Python Tricks for Data Science & Data Analysis — Part 5


📈 28.69 Punkte

📌 Firefox Translations and Other Innovations – These Weeks in Firefox: Issue 139


📈 28.68 Punkte

📌 When Your Day Job Isn't Enough


📈 27.04 Punkte

📌 When one isn’t enough: This shady malware will infect your PC with dual Trojans


📈 27.04 Punkte

📌 WFH is the new religion – but blind faith isn’t enough to keep your infrastructure secure


📈 27.04 Punkte

📌 Why Your Tech Stack Isn't Enough To Ensure Cyber Resilience


📈 27.04 Punkte

📌 [Python's Pandas] The Future Of Pandas


📈 26.89 Punkte

📌 No More Sad Pandas: Optimizing Pandas Code for Speed and Efficiency


📈 26.89 Punkte

📌 Pandas for Fun and Profit: Using Pandas for Successful Investing


📈 26.89 Punkte

📌 Pandas in One Hour (Introduction to Pandas Library)


📈 26.89 Punkte

📌 How to Rename a Column in Pandas – Python Pandas Dataframe Renaming Tutorial


📈 26.89 Punkte

📌 pandas.DataFrame.sort_values - How To Sort Values in Pandas


📈 26.89 Punkte

📌 Pandas round() Method – How To Round a Float in Pandas


📈 26.89 Punkte

📌 Python vs Pandas - Difference Between Python and Pandas


📈 26.89 Punkte

📌 Need for Speed: cuDF Pandas vs. Pandas


📈 26.89 Punkte

📌 Measuring The Speed of New Pandas 2.0 Against Polars and Datatable — Still Not Good Enough


📈 25.86 Punkte

📌 Best Malware Analysis Tools | Learn Malware Analysis


📈 25.79 Punkte

📌 4 Faster Pandas Alternatives for Data Analysis


📈 25.34 Punkte

📌 Introduction to Pandas and NumPy for Data Analysis


📈 25.34 Punkte

📌 DATA ANALYSIS WITH PANDAS AND PYTHON


📈 25.34 Punkte

📌 Your server remote login isn't root:password, right? Cool. You can keep your data. Oh sh... your IoT gear, though?


📈 24.97 Punkte

📌 Enough is not enough - Unternehmen wachsen mit dem Zero Outage Industry Standard über sich ...


📈 24.82 Punkte

📌 Poll: Facebook Harvests Email Contacts for 1.5M Users – Is Enough, Enough?


📈 24.82 Punkte











matomo