A step-by-step guide for what to do when you’ve got more data than one computer can handle.
At some point in a nascent data scientist’s career, you’re going to encounter a dataset that is computationally just too much for your individual machine to process. Or maybe you’re working on a collaborative project and have to give different users read/write access to the same dataset. Either way, once you’ve outgrown the Jupyter Notebook and are ready to start building real storage solutions, there’s no better place than the cloud.
Getting Started with AWS
Regardless of which storage solution you choose, you will likely need to launch an instance of Amazon Relational Database Services.
From the home page of the AWS Management Console, you’ll scroll down to the Database section and click the first option “RDS”. Alternatively, you can search for “RDS” in the search bar at the top of the AWS Management Console.
Clicking that link will bring you to the Amazon RDS Dashboard, where you’ll scroll down to about mid-page to a dialog box that says “Create database.”
Now, most settings are pre-configured to industry best practices, but we’ll want to fine-tune them for our purposes. First, you’ll have to decide which RDMS (Relational Database Management System) you’d like to use to manage your data. PostgreSQL, MySQL, and MariaDB are all open-source systems with free-tier eligible storage. Don’t worry about Oracle and Microsoft SQL Server — those are commercial platforms.
Then, under Templates, you’ll go ahead and select the “Free Tier” option. The Amazon RDS free-tier includes 20 GB of General Purpose (SSD) storage for up to 12 months. For more information on AWS pricing scales, feel free to check out their website.
Next is the most important part of this entire configuration — the database instance name and username/password fields. None of the credentials in the “Settings” box can be changed after a database instance is launched.
The DB instance name is different than the database name itself, so choose something unique. Also, the credentials underneath are the master username and password (admin account) for this instance. This login can be used to grant other users access to the database. BE SURE TO SAVE THIS INFORMATION! You shouldn’t have to change too many other settings on this page, with the exception of Connectivity. I’d recommend giving your DB instance a public IP address, to make it easier to connect to PgAdmin4.
Finally, I’d also recommend using the default VPC security group. For that, you click “Choose existing” security group, and then select “Default” in the drop-down list.
That’s it! Once you’re done, scroll all the way to the bottom of the page and click the orange button that says “Create Database.” This will take you back to the Databases Dashboard where you can see all the information about your new DB instance. It will take several minutes to process, and your DB instance isn’t ready until the Status column changes from “Creating” to “Available.”
Congratulations! You just created your first database in the cloud!!
Connecting to PostgreSQL
After taking a few minutes to celebrate all your hard work, we’ll need to configure a few security settings to ensure the DB instance’s compatibility with PgAdmin4.
Click on the DB instance, and navigate to the Connectivity & Security tab. Notice the endpoint URL. SAVE THIS FOR LATER. The Port 5432 is standard for PostgreSQL, but copy that as well if you made any changes. Next, scroll to the far right of this tab to the active VPC security group. Click on the link that says “default.”
This will bring you to the EC2 Management Console. Scroll down and click on the “Inbound rules” tab. Then, click on the button on the right of the screen that says “Edit inbound rules.”
From here, you’ll have to manually change a few of the security protocols. First, you’ll need to click the Type dropdown and select “All Traffic”, followed by clicking the Source dropdown and selecting “My IP”. Then, you’ll have to add a rule and repeat this process with Type “PostgreSQL” and Source “Anywhere.”
Ok, we should be ready to go! If you don’t already have PgAdmin4 installed, go to their website and follow the installation instructions. After opening the program, it might ask you to create a Master Password. SAVE THAT SOMEWHERE. Next, you’ll click the “Add New Server” button right on the dashboard.
This will bring up a Create-Server window. Feel free to enter whatever database name you wish on the General tab (this has nothing to do with AWS). Now, click on the Connection tab. This is where you’ll enter all of the credentials needed to access your DB instance.
Remember that endpoint URL from earlier? Paste that into the host name/address field. Similarly, the Port should match whatever you instantiated in your Amazon RDS configuration settings. The username and password should be an AWS account that has been granted access to the DB instance. Double check everything, then click Save. You’ll know it’s working when the Dashboard starts generating transaction logs.
Alright!! Way to go! We’re almost done.
Unfortunately, there is a convoluted folder structure to get to your data. You have to click all of the drop-down arrows starting with your database. So here that is:
*PG Admin Server name*> Databases > *AWS database name* > Schemas > public > Tables
Right-click the Tables icon and select Query tool. This is where we’ll input the SQL commands to upload our data.
You should see the Query Editor window on the other side of the folder structure system. Here, you’ll copy and paste SQL code to create the schema for your first table. Note — the column names should match the header row in your .csv file exactly. You’ll also have to specify the datatype for each column.
The different datatypes in SQL are: integer, bigint, numeric, double precision, money, boolean, and text.
CREATE TABLE public.backers
Then click the “Execute/Run” button on the top toolbar. You should get a green pop-up notification in the bottom right-hand corner of the screen that says “Query returned successfully.”
Navigate back to the folder structure on the left-hand side of your screen. Right-click on Tables and select “Refresh.” You should now see your newly created Table in the folder structure. Right-click that new Table and select “Import/Export.” This will bring up a new Import/Export window.
First, click the Import/Export slider at the top of this window. It should turn green and say “Import.” Then, use the file dropdown to select a .CSV from your local computer. The filepath will load in the corresponding “Filename” field. Click the slider for “Header” if the .CSV contains a row of column names (most do). Select a comma as the delimiter (or other character depending on your data). Finally, designate an escape character (usually double quotes), to account for non-delimiting commas. This means “Last Name, First Name” would not get split into two columns.
Then click ok! If everything is working correctly, you should get another pop-up notification with details on your AWS DB instance and a green bar that says “Successfully completed.” If you run into any errors, double check the original .CSV file at the row specified for missing commas or issues relating to escape characters.
That’s it! All of your data is now stored safely in the cloud for you and your team members to query away.
You can repeat the CREATE table and .CSV import steps for as many files as you would like to upload. Just be sure to continually monitor your AWS usage and logs so you don’t accidentally exceed the bandwidth cap on free-tier services. Best of luck out there!