Optimizing AWS Costs for big data workloads

Optimizing AWS Costs for big data workloads



Many  many moons ago, when we had very less experience with AWS technologies  and had no clue on the scale and volume to which the business might  grow, we over provisioned some of our AWS resources just to be on a  safer side. We at  Finbox   committed this mistake, and when our AWS bills started skyrocketing, we  narrowed down multiple areas in AWS which could have been optimized. In  this blog, we will be discussing Redshift and RDS. In the subsequent  blogs, we will be discussing how you can optimize API Gateway, Kinesis  Streams, EC2 Instances and S3.

Redshift

AWS provides this petabyte scale fully managed columnar database offering. Companies store their  hot  data in this AWS offering for speed and SQL-like query syntax.

  1. Compression

15%+  of our total infrastructure cost goes in Redshift. If you store a table  in Redshift without proper encoding of the columns, tables will  probably  not  get compressed and will end up taking huge space. Since its pricing is  on the basis of the number of nodes, it basically makes it proportional  to your storage space. AWS provides this very cool feature of deep copy ,  using which you can compress your existing uncompressed columns, but  comes with its own limitations. If you are not aware of the type of compression  you should be doing on different columns, redshift provides a command

analyze compression

This is a time-consuming step and might take a while to produce the expected encodings

This will output results something like

Column   |  Encoding------------+-----------  A      | RAW  B      | mostly16  C      | bytedict  D      | lzo

Please  note that you should never encode your sort-key in redshift. It expects  it to be in a raw format for indexing and scanning. Having said that,  once you know the type of encodings you should be doing on each column,  create a new table with schema as per the expected encodings. You can  get the schema of the current table using:

pg_dump -h  -U  -d  -t  --schema-only -p 5439;

Output would be something like :

--

-- PostgreSQL database dump

--

-- Dumped from database version 8.0.2

-- Dumped by pg_dump version 9.6.10

SET statement_timeout = 0;

SET lock_timeout = 0;

SET idle_in_transaction_session_timeout = 0;

SET client_encoding = 'UTF8';

SET standard_conforming_strings = off;

SELECT pg_catalog.set_config('search_path', '', false);

SET check_function_bodies = false;

SET client_min_messages = warning;

SET escape_string_warning = off;

SET row_security = off;

SET default_tablespace = '';

SET default_with_oids = true;

--

-- Name: TABLENAME; Type: TABLE; Schema: public; Owner: USERNAME

--

CREATE TABLE public.TABLENAME (

id bigint DEFAULT "identity"(334987, 0, '1,1'::text),

batch_id character varying(50),

start_timr timestamp without time zone,

duration integer,

type character varying(10),

created_at timestamp without time zone

);

ALTER TABLE public.TABLENAME OWNER TO USERNAME;

--

-- PostgreSQL database dump complete

--


Create  a new table using the output of the above command, and updated column  encodings. Now copy everything from the original table to this, do some  sanity testing and drop the old table and rename this one. This simple  activity can free up to 30-40% disk space, which will directly get  reflected in your savings. This activity might be challenging for the  use cases where data is continuously getting injected in redshift. We  are using  firehose  to load to redshift.  Firehose provides a retry functionality ,  wherein if the incoming datapoints injection fails, it will keep  retrying for a maximum of 2 Hours (You can find this option in firehose  configuration sessions). This helps us in making deep copy activity  smooth and hassle-free.

  1. Daily ETL

We  used redshift as our data-warehouse cum data-lake. This is one of the  biggest mistakes we had done in our server configurations. All of us  have heard about redshift spectrum ,  but we never actually bothered to dig deep into these Amazon offerings.  This was a gamechanger for us in terms of saving money. Spectrum  provides a very nice abstraction around your structured and  semi-structured data which is stored in S3, which you can query from  redshift itself, and join with tables stored in it. This helped us bring  down our  x  node cluster to  x/4  node.

External  tables can be created only in an external schema, which virtually  refers to an external database. To create an external schema Spectrum,  use

create external schema spectrum  from data catalog  database 'spectrumdb'  iam_role 'arn:aws:iam::123456789012:role/SpectrumRole' create external database if not exists;

External Tables can be created using:

create external table spectrum.(

column_a integer,

column_b integer,

column_c integer,

column_d integer,

column_e smallint,

column_f decimal(8,2),

column_g timestamp) row format delimited fields terminated by '\t' stored as textfile location 's3:///a/b/c/d' table properties ('numRows'='36000');


Now  you can simply use redshift unload and copy command to load data in  this external table. Spectrum billing in on the amount of data that is  scanned, and since we rarely query our raw data once it has been  featurized, it helps us save some bucks.

Redshift comes with 2 types of storages,  dense compute  and  dense storage .  You can either optimize your cluster for queries or have more storage  but compromise upon the speed. But there is a small catch here. Redshift  enforces you to take a cluster of a minimum size of 2TB for the latter  case. Even though the per GB cost of this type of redshift is cheap, but  if you have a volume of roughly 200-300gigs, there is no point in using  this node. To continue to keep our Redshift cluster to a single node, we are using  airflow   to move everything accumulated in Redshift, on which the featurization  and extraction jobs have already incorporated in their models to S3 in a   parquet   format. We have configured spectrum with the same cluster which helps us  run all our ad-hoc queries. Note that we migrate all the data from  redshift to S3 in parquet format, because spectrum billing is on the  amount of data scanned, and parquet being a heavily compressed columnar  storage format which can store your structured data.

  1. Vacuum Cleaning

Whenever a  delete  or  update  query  is executed on redshift, it does not claim the free space  automatically. This is a choice by design by Redshift developers. Hence,  every firm using redshift is expected to set up a policy of  periodically running Vacuum on redshift clusters. There are multiple  types of Vacuum , which you can choose depending upon the use case and resource availability.

RDS

1*w4gNSu35Nt0uIRs4WIZNsw


The  cost of RDS service comprises of EC2 compute cost, EBS data storage  cost, IO cost, and outbound data transfer cost. In order to optimize the  RDS bills, it is important to select a right-sized instance for your  application. It is also important that you have provisioned apt amount  of throughput to this instance as per needed by your application. You  can use AWS managed console to fine tune the RDS configuration, or you  can use some fully managed services which take care of optimizing your  RDS instance.

One  major learning that I have had in my recent experience with RDS is  always avoid data dumping. It's important that all the devs in your team  understand this, and do not load any unwanted tables/databases in RDS.  This might seem a trivial problem, but if you are a company which deals  with humungous data and heavily rely on Machine Learning, chances are  high that someone might dump data in your running RDS instance, for  their temporary testing/training and later forget to delete it. We at  Finbox   have 50+ databases in some of our instances, and it's difficult to get  track of databases which are actually of use in this. Set up a policy  and train developers so that only data that need to be in RDS are stored  in it. Review the schema periodically and optimize it for RDS.

Another  thing to keep in mind is to carefully design and reviewing SQL queries  which will, in turn, minimize the load on the machine. Things like  caching auth-token instead of validating it from database can reduce one  DB call per request made to the server. Small small things like these  can make a tremendous difference in the load on your machine which will  directly get reflected in your bills.

We realized that even though we had Reserved Instances ,  our daily RDS bills were significantly high. This came to us as a  surprise as the expected bill was supposed to be negligible. Upon some  research, we realized that the RI's are just for the RDS instance, and  storage, snapshots and backup charges are separate. After further  digging, we realized that while spinning up a new RDS instance, AWS  provides some defaults, which a user should understand before hitting  create.  Some of these defaults are: a.  Default Instance type :  AWS suggests m4.xlarge as default type. Be sure if you really need this  big instance. For most of my use case, especially for our dev  environments, a dual-core machine with 4gigs of RAM (t2.medium or if you  want to stay in m family, m4.large) is more than enough. This really  depends on your use case, and the kind of traffic you will be routing to  this DBb.  Multi-AZ deployment:  This  option basically configures if you need your database to be present in  multiple regions or not. This option is checked by default but again  know your use case. You might not want to replicate your DB in  multi-region and increase your server cost multifold times for your dev  environments.c.  Storage Type:  AWS provides 3types of storages:

  • Provisioned IOPS  -- Provisioned  IOPS storage is designed to meet the needs of I/O-intensive workloads,  particularly database workloads, that require low I/O latency and  consistent I/O throughput.
  • General Purpose SSD :  General Purpose SSD, also called gp2, volumes offer cost-effective  storage that is ideal for a broad range of workloads. These volumes  deliver single-digit millisecond latencies and the ability to burst to  3,000 IOPS for extended periods of time. Baseline performance for these  volumes is determined by the volume's size.
  • Magnetic  -- AWS  RDS also supports magnetic storage for backward compatibility. We  recommend that you use General Purpose SSD or Provisioned IOPS for any  new storage needs. The maximum amount of storage allowed for DB  instances on magnetic storage is less than that of the other storage  types.

Please note that it is not an easy task to reduce the size of your instance.It's  a very common situation where a team realizes that their RDS instance  is way too overprovisioned, and the need the shrink the DB size. Some  might try to take a snapshot and restore a new DB with lesser disk space  just to realize that shrinking a snapshot feature is not provided by  AWS. We at  Finbox   also found ourselves in a similar situation and wrote a simple script  which took the backup from RDS and restores it in the smaller variant.

Github Gist

FinBox  works with banks & NBFCs to digitize their customer journeys &  to help them underwrite NTC customers using alternative data from the  smartphone. To get in touch, drop a line here

We are hiring!!! If the stuff I have discussed above excites you, feel free to get in touch at rupesh@finbox.in OR tech@finbox.in


Shweta Singh
Shweta Singh

Product Marketing Specialist