Moving Shopify’s Data Platform to Google Cloud (Cloud Next ’18)

Moving Shopify’s Data Platform to Google Cloud (Cloud Next ’18)


[MUSIC PLAYING] YANDU OPPACHER:
Hi, good afternoon. I’m Yandu. I’m with Shopify. I’m a director on the Data
Platform Engineering Team. And earlier this
year, we started migrating our data center
from on-prem in Chicago up to Google Cloud. Shopify is the leading
cloud-based commerce platform. We started in 2012. We have about 600,000
users, active merchants, and spread around 175 countries. And together, they’ve sold $63
billion worth of merchandise. So that’s really
Shopify by the numbers. But what Shopify tries to
do is also make a place where an individual brand can– an individual merchant
can build its brand, and build their brand,
and really find out their unique voice and be
able to bring that to market. And we try to support
them through our customer support, our ecosystem
of apps and developers, and really just try to
make sure that they’ve got every advantage that they can. So this is why we’re here. My senior VP of Data
in the fall of 2016 said, all data processing
needs to be in the cloud by Q4. Fortunately, he
didn’t say what year. But it still had us
thinking and worrying. And so that’s what he said. Implicitly, he also said we
can’t have any SLO violations. Our customers still need to
be able to get their data insights. And the business
Shopify needs to be able to answer the questions
it has, and also no data disruptions. So we couldn’t lose data,
we couldn’t corrupt it during this migration. So we managed to do it. And I’m really excited
to tell you about it. But first, I want to talk
a little bit about the data platform as it was in the
DC and how it grew up. So this is going to be
really brief introduction to our infrastructure. So this is how it started. These servers arrived
the same week I did. And my bootstrapping
onboarding task was to bootstrap these nodes. Took me two days to
realize that our HDFS nodes didn’t have any
hard drives in them. Once we figured out why
we couldn’t install an OS, everything went smoothly. So from there, once we had
the machines up and running, we installed HDFS as our
data lake for file storage, YARN for our compute and
running all of our jobs. The jobs primarily are Spark. And that’s where all
of our ETL does happen. It’s modeling the data. And then finally,
about two years ago we introduced Presto
as our query engine so that we could
make all of HDFS data available and
queryable via SQL to our analysts and
the rest of Shopify. So final state, we had
11 petabytes of storage. By the time we migrated, we were
using about nine and 1/2 of it. Between Presto and YARN, we
had around 250 compute nodes. We added 80 terabytes
of data per day. This is net new. The amount that we
actually produced was a lot higher,
around 250 terabytes. But that was with our
lifecycle management. Really net new was
just 80 terabytes. And then finally we run
25,000 ETL jobs per day on that nine petabytes of data. So from pretty
humble beginnings, we grew to a pretty big size. And we were starting
to feel the strain of managing this infrastructure
with a small team. We went from buying
servers, to buying racks, to building out data cages. So what does it do? Basically this, just
lots and lots of ETL. This is where people do their
exploration, their modeling, where we build
data insights both for our internal customers,
but also for our merchants. So it breaks down the way
we think of it logically into data acquisition, which
is how the data comes in. This is data pulled in from
our MySQL shards, and also our Kafka Message Bus,
plus third-party services like GitHub or ZenHub. And then once the data lands,
data processing takes over. This is the majority
of the Spark jobs that then take this data,
model it, and build out of the data artifacts
and data assets that we use to
power our business, but also help our merchants. And then finally, we’ve got
data discovery and access. And this is really how
the data assets make it back to our customers. And it’s mainly
through Tableau or Mode for exploration and reporting. And then we’ve got Redshift
as an analytic database, and then also Presto for
querying and exploration. That’s where we
were January 2018. And then we found
out our data center and contracts were up in July. So we had six months. So we had roughly five
infrastructure engineers, 15 to 20 data engineers,
and six months. So for us, we really had
to focus on time to cloud. And we made a bunch of
decisions around that to make sure that we could
get out of the data center as quickly as possible. So we were optimizing
for speed and trying to also not impact our
customers, like the data scientists, and really
try to replace what we had and make sure that it’s as much
of an implementation detail as possible. So to do that, what we
did was we moved to– we really just wanted
to lift-and-shift our infrastructure. We understood HDFS,
YARN, and Spark. And this is also what all of our
customers were familiar with. We didn’t want to
burden them with having to port all of their models
and all of their jobs over if we could just move it
up to GCP without them seeing or having to worry about it. But as much as we didn’t
want to use new services, we also wanted to be
pragmatic and use services where it made sense. So for us, there was
an operational burden running things like HDFS. So if we could shift that
operational burden over to Google, we did it. And then, we also knew
that we needed to get out of the DC quickly and into GCP. And we knew it wasn’t
going to be perfect, but we just needed
to do his job and be able to get all of our
data assets rolling again. So, we knew that there
was lots of optimizations we could do once we had
the flexibility of being in Google Cloud. So we wanted to– but none
of those optimizations would actually help us
get out of the DC faster. So things we could do with
compute, preemptible VMs, running clusters that are
specific to the workload, those sorts of things are
great, but again, they wouldn’t help us move
out of the DC faster. So the first piece to
move was data discovery. This is like starting
with dessert first, but there’s a
reason we did this. So just as a reminder,
data discovery, this is really the
portal into the data warehouse that we gave all of
our customers within Shopify. So this is Presto
for querying and also Tableau and Mode for reporting. So these are the
reasons why we chose to do the data discovery first. One of the big
ones is that there is no data set dependencies. Really, all the consumers
wanted was the latest data. We didn’t have to worry about
operational interdependencies to actually generate this data. Also, we knew we were going to
have to move that three and 1/2 petabytes of data up to GCP. So we we’re going to have to
get good at moving data anyways, so we might as well do it
with a slightly lower load. And then it was also the
one feeling the most pain. Presto usage took
off like wildfire. We started with four
nodes, and had to scale out to 60 within a year. And it runs 25,000
queries per day across the three and 1/2
petabytes of data that we have. When Presto really is our
query engine and SQL engine to all of the data in HDFS– and when we look
at our data lake, we think of it in two ways. We’ve got the raw
data, which is what data acquisition brings in. This is really just
operationalized data that we bring in,
consolidate, and have a historical record of. And then there’s the
model data assets, which are cleaned and
built by our ETL framework. So this is what it looked
like in the data center. Basically we had
HDFS and Presto. Redshift was our first
analytic database. And it only contained
front room data. So once front room data
was built via the ETL, we’d package it up,
we’d ship it up to S3, load it into Redshift,
and then people could start reporting off of it. Presto had access to
all of the data in HDFS. And that’s where
it queried from. So all it needed to do
is really a metadata. We just say this
is the latest data. You can start querying that. This is the pipeline that we
had to build for the migration. So the good news
is that we already knew how to move
quite a bit of data because we already moved
the data from HDFS to S3 for front room data. And then we added
this new pipeline to move the front room data
from HDFS up to Presto. The one part that
we didn’t quite have yet was for the raw data. And this is where the majority
of that 250 terabytes that we build everyday goes. So we had to work
with our network team to make sure that we didn’t
DDoS ourselves going with all this data out the front door. So our network team looked
at the amount of data we wanted to move
and also the latency. And this was an
important piece too was that we had to
go and understand what our customers wanted
in terms of latency. For them, new data is generated
around every four to eight hours. So an extra 15 to 20
minutes of needing to wait for the freshest data was OK. No relevancy would be lost
by this lower latency. So after discussing,
understanding that the latency concerns
that we had, how much data we wanted to move,
and also wanting a little bit of head
room for growth, we decided we got four
40-gigabit direct connects with Google and made sure that
all data coming from our data warehouse was segregated and ran
specifically over those links so that we wouldn’t impact
anything else in the data center. And moving this
data, we had already gotten good at being
as part of our workload uploading the front room assets. And now we also needed to
upload the back room assets and register them. But this was just another
step in the workflow that we already
understood and knew well. So for part of our migration,
we Dockerized Presto. One of the reasons we
did this was in the DC, we maintained everything
with Chef, our configuration management. But that wasn’t going
to be maintained once– Chef itself wasn’t
going to actually be supported once we left the DC. So we needed to
move away from it. And we also wanted to have
immutable infrastructure. So this is the format
that we came up with. So we have Presto dockerized. It’s pretty simple and
an easy one to dockerize. And then we also dockerized all
of our monitoring and logging tooling. All of these run as system
D unit file– or sorry, unit services. And we ran on
container optimized OS. We chose that because
it’s optimized for Docker. And we also didn’t have
to worry about somebody going onto an Ubuntu
image or something, and running scripts,
and installing stuff, and slowly the clusters
started to drift. We already had a
very good pipeline for version control and
building out Docker containers. So we just jumped on that. So once we had the
nodes, we started using Google Deployment
Manager to actually be able to deploy the clusters. And this gave us the
ability to quickly iterate because we didn’t have to slowly
stand everything up and make sure the config
files were there. We could just lean on Google
Deployment Manager for that. Also, Presto, it’s
pretty simple. There’s a coordinator and
workers to do the work. And this gave us a easy
way to start understanding how Deployment Manager works
and how to actually build out more complicated clusters. And we used instance
groups so that we could scale at the workers on demand. So once we had the
cluster, and everything up, and we’re moving
the data, we started to run in parallel with the DC. We made sure that we
could stage the data, that it was consolidated. And it was reconciled to make
sure that the data we thought was there is actually
the data that was there. And we opened it up for
beta to some data scientists who were super eager to
go and start using it because they could get away from
the current cluster, which was overloaded with too much work. So after we reconciled, made
sure the data was there, and we’re comfortable with
how we were deploying it, and felt it was
ready for production, we flipped the switch. So until now, the DC
Presto was still running. And it was the single
source of truth. When we were ready,
we switched it over so that we switched
all the reporting to start going against GCP. This was in late Q1
that we switched over. So once we had the
Presto up and running, we also wanted to think
about security a bit more. So in the DC, we secured
it with VPN at the edge. But what we lost there was
some auditability because without a plug-in, Presto
just allows anybody to put whatever username that
they want into the user field. So we knew the people that
were using it, so it was OK. And there is trust
within the data org. But we wanted to open
it up to more than that. And so what we
did was we created an OAuth proxy, which is similar
to the Identity-Aware proxy. And basically what
happens is a user will log in via this proxy
in their web browser, follow the OAuth flow
that they’re used to. We get back a signed
token from Google. Then the user can connect to
the client and their proxy. And that client will proxy their
request along with the token up to the server. Once it’s on the server, we
now have the signed JWT token that we know exactly who it is. And then we can set the username
before it goes into Presto. So this gave us
end-to-end auditability on all of our queries. So that was moving Presto. So we started with
the end of the meal. And now we’re going to actually,
I guess, set the table. So again, this is how all of the
data enters our data warehouse. We pull it in from MySQL
shards, from Kafka, and from third-party
sources like GitHub. It’s really the only
place where we’ve got operational
snapshots of the data. And this is the base
of all of our ETLs. So Speedboat is the
tool that we built to run our data acquisition. It’s written up in this unholy
pairing of JRuby and Scala. It’s a basic change
data capture. And it produces,
again, full snapshots of the operational models. And the way Shopify core
is running in our DC is we’re sharded. So that means we’ve scaled
all the shops horizontally. And the data warehouse
is really the only place you can get a consolidated
view across all of Shopify. So Speedboat has
two main phases. One is the extract phase. This is where we actually
pull in the new data. And this is basically
like a select star where updated at is
greater than whatever. So we get back all
the data that’s changed since the
last time we did this. And then we fold that
into an existing snapshot to produce the new operational
table at that time. So for moving to the
cloud, we had to move to– we started leveraging
some managed services. And this part was
really good because it allowed another team, not
just the infrastructure team, to actually be able to
own their infrastructure. They decided to
use Google Cloud– or sorry, Google
Kubernetes Engine as the extractor and colocated
that with Shopify core to be able to access
all the databases. They also use Dataproc to be
able to build these snapshots. This was all built on
top of Spark and Scala. And then we used Google Cloud
Storage for our storage engine, our storage layer. So this is how it looked. And basically, Speedboat
would run its extracts from all the different shards. It would land those
extracts onto GCS. And then Dataproc would pick
it up, fold the new changes into the existing snapshot,
and put it back down on GCS. And that’s how we got the
root for all of our modeling. So this is the piece
that took the longest. And this is the big hairy
beast that nobody really wanted to tackle. This is the heart of
our data warehouse. And this is where all of
the raw data gets refined and where we build
out the data assets. It’s insanely complex. So this is about a third
of our dependency graph. That’s 3,500 individual jobs
which run 25,000 times a day. So in the DC, I’ve
mentioned this before, we have a
YARN cluster running on top of HDFS as our data lake
and PySpark as the main driver for all of our data and ETLing. So at Shopify, we started using
Spark and PySpark specifically around the latter part of
2013 and into early 2014. And so we’ve got a
lot of legacy baggage that came along with that. One of them was that we had
to be able to distribute our work across the cluster. We couldn’t just run
it on a single node. And this was before YARN cluster
mode was fully available. So what we had to do was
dockerize our driver, make sure that
everything was there, all the native libraries. And then we could actually
run that Docker driver anywhere in the cluster
to spin up a new YARN job. And we weren’t confined
to a single box. And again, because
of PySpark, there were a lot of native libraries
that we had to deal with. So NumPy, SymPy, SciPy, these
were all native libraries that we had to compile and run. And it took about 40 minutes
to bootstrap a new node with these dependencies. So how did we migrate
this to the cloud? So again, looking
at time to cloud, we had to make a couple
of choices up front. And one of the major
architectural changes we had made was going
from HDFS to GCS. And this was, again, trying
to deal with context-free versus context-full,
so something where we didn’t have any
competitive advantage of running HDFS ourselves. And also it’s cost prohibitive
in GCP to run HDFS. So we decided to move
to Google Cloud Storage. There was also the added benefit
the Google Cloud Storage’s API is compatible with
HDFS’s file system API. So it was a drop in replacement. But GCS isn’t a file system. And this caused
us a lot of grief. It’s similar if
you’re driving a car with an automatic
transmission, and then you swap out and give it
a manual transmission. You’re still going
to be able to drive, but now there’s
a lot more things that you have to worry about. And for us, there were a lot
of built-in assumptions with the way that our framework– our four ETLs–
did its job, mainly about atomic directory moves. That doesn’t happen anymore. So what we had to do
was add sentinel files to ensure that we
weren’t picking up partially processed data. So again, we took
the same approach. We had Chef running in GCP– or sorry, in our DC. And it had grown organically. The configuration management
for this, for Hadoop, had grown organically. I tried for two
months to refactor it, but then rage quit. So, we didn’t want to bring
this ball of YARN and this mess up with us to GCP. So again, we looked at
dockerizing everything. This was a more complicated
deployment than Presto. But we had already understood
some of the best practices from going through
the Presto migration. So we again used GDM for
deploying and managing our deployments. And then we decided to use
standard Google Compute Engine VMs to deploy onto. So we desperately
wanted to use Dataproc. But we couldn’t
ever get the boot up time for a Dataproc cluster
to be under 30 minutes. And that’s because of all
these native libraries that we had to compile and build up. So there was no way of
actually amortizing the cost. We wanted to pay that
cost of building once and then be able
to deploy quickly. So what we did was
we leaned on the fact that we had already
dockerized and paid the cost for the drivers
for all of our ETL jobs. Then all we needed
to do was really overlay the node manager and
resource manager libraries over top. And so now we could go
down from 40 minutes to spin up a new node
down to two to three. So this is what
we ended up with. Again, it’s very similar
to the Presto nodes. We have the node manager
running all the jobs. All of our logging
and monitoring are also Docker containers built
up on container-optimized OS. The one thing that
we had to change, though, was the way that
we launch ETL jobs because of the Docker container. We had to actually
mount the Doctor socket into the node manager
so when it spun it up, it would use the VM’s Docker
Daemon to actually spun these jobs. We had some issues too
where now, all of a sudden, if a node manager failed
because they weren’t managed by the system D service, they
would keep running headless and potentially cause
some data corruption. So we had to add in some hooks
into the system D unit files to make sure that if the
node manager ever shut down, we’d go and reap all of
the Docker containers. So as our deployments
became more complex, and we deployed
to more projects, and we also had more
services coming online, we invested in
building something we call Data Deployment
Manager, which is really just a wrap around GDM
to enforce our best practices around how we want
to package and deploy versions or deployments. And it also makes other
teams more autonomous. Because now instead
of having to make sure that they configure the
GDM deploy exactly the way we want to, we can
automate this process, make sure that they’re
running in the latest version, and also have the latest
version of the configs. So how we manage
our configs, this is something that we built in
addition to the Data Deployment Manager. And this is Configurator. It’s a system to manage
and deploy application across our clusters. It works by keeping versioned
configuration files and source control and having a
sidecar container run on each of the VMs. At startup, the sidecar is
the first one to start up. It will pull down all the
different config files and runtime information that we
can get from the Metadata API. And using a
templating library, we can stitch these all
together and put them into a mounted
Docker volume, which then all the other services
on that node can pull off of. So they don’t
actually need to know what the latest
config versions are or how to stitch them together. They just need to know
which ones they want, pull them down, and put
them in the right place before they start up. So this allowed us
to also separate concerns of building
the Docker containers and deploying the
configurations. And the nice thing too
here is that we can now rollout config
changes without having to bounce the entire cluster. The Configurator
will pull it down and build out the latest
version of the configs, and then make sure that
the services pick them up. So, this is how we
build out the node and we built out the
deployment and made sure that we can do the
configuration properly. So once we were ready for the– so our migration plan,
once we had that, it was really just get
the infrastructure ready. So the end piece
that you see here with Configurator and our
Data Deployment Manager, that was something that evolved
over the entire migration, and is still evolving. What we did and what we
wanted to make sure of is because of the
tight time deadlines that we weren’t bottlenecking
anybody above us from running tests or
anything like that. So basically, we lied,
cheated, and stole to get our first cluster up
to make sure that people could start running some smoke tests. And then as we saw where the
bottlenecks were and issues were, we refined our
process and made sure that it was repeatable,
faster and faster. Once we had some infrastructure
ready, what we wanted to do was fire some tracer
bullets to make sure that the end-to-end
workflow worked as expected, that there weren’t
any huge performance issues, or networking issues,
or security issues that we just hadn’t thought of. We did this before
we had to worry about that huge dependency
graph I showed before. And it was really just a
one end-to-end workflow with no dependencies. So once we were comfortable
that we understood how we could run one
job from end to end, we needed to look at
how to stage the data and move all of the data
from the DC up to GCP. And this was so that we could
start running essentially a full production workload
once we were confident that at least all the tooling
was there to be able to run it. So once we had staged
the data and started to spin up our
production workload, we ran in parallel
with the DC again. And this was to make sure that
we could reconcile the data. And this is also a place
where we made some decisions. So before the data
acquisition layer I was talking about how
there is an extract phase, and that is actually– depending on when
you do the extract, you might get different data. And so this is something where
we now have the potential to add a little bit of
wobble into the root of our tree, which
just gets echoed throughout the entire process. So we wouldn’t have had
any hope of properly reconciling the data if we
introduced noise at the root. So what we did was we
actually left the extracts running in the DC even when
we were running in parallel. And then we push up
those extracts to GCP and then run the snapshot. So this way we ensured that at
the root of all of our ETLs, the data matched exactly. So we ran in parallel
in the DC, made sure that we could reconcile
all of the data. Nothing changed. Again, we didn’t violate
our SLOs or SLAs. And then finally, we
flipped the switch again. So until this point, Presto and
Redshift, and all of our data that was consumed
by our customers was running from data
that came from our DC. So we always wanted to
maintain and make sure that we had a single source of truth. We didn’t want to have some data
in DC and some data in the GCP and then have to synchronize. So basically at one point,
we just started to flip. We flipped the
switch and started doing reporting from our– or sorry, we started loading
data from GCP instead of the DC and slowly moved all
of our data to have the single source of truth in
GCP before we shut down the DC. So that was the bulk
of our migration. Again, these are the logical
pieces that we moved. There was dozens of
services involved in them. And so here’s a brief timeline. So in Q4 2016, David
gave the mandate that by Q4 all data processing
needed to be in GCP. Then 2017 was pretty
much all entirely eaten up by contract
negotiations and working with Google. And this is a piece
that I hadn’t factored into my estimates when
I was building out capacity and budgeting
for our data center needs for the data warehouse. I assumed that we’d have
the contract signed, and we’d be able to
get going, and we wouldn’t need to scale as much. So it was a bit of a scramble. And it was this
game of not wanting to invest in the hardware
that we knew we were only going to use for a handful of
months versus actually making sure that we didn’t
violate SLOs. So it got a little
tight near the end. So by April 26, we managed to
run all production out of GCP without any SLO violations
or data disruptions. And as of July 1, we shut
down the data center. So one of the things that
you need to think about is really what are
you optimizing for? Is it cost, or is it
time to migration? For us, it was
time to migration. We knew that we were
going to potentially be inefficient with cost. Our executives knew that. And they were OK because
maintaining two legacy systems– or sorry, maintaining
two operational systems and also having to renegotiate
DC contracts if we slipped was going to be much
more expensive and harder to do for any longer than
we absolutely needed to. The other part is really
using the right tools. So for us, lean on managed
services as much as possible. If you can shift operational
burden away from your team and onto Google, that’s a win. But you also need to
understand the trade-offs. For us, we wanted to shift the
operational burden of running our YARN cluster to Google but
we couldn’t because of these boot up times. And we knew that Dataproc
was working on this. But at the point when we
needed to make the decision, it wasn’t ready yet. And we weren’t ready
to bet all of our ETLs on a product that was in alpha
yet– or a feature, sorry, that wasn’t in alpha yet. So this is picking the
appropriate path forward. Even if you know
it’s suboptimal, and it might raise the
operational burden, you pick the path
forward that you need to make sure that you
can optimize towards what you’re looking for. And then for the migration
plan, really for us it was trying to iterate
as quickly as possible. So we wanted to make
sure that we were never a bottleneck for our
upstream consumers, and that we could fire
these tracer bullets as quickly as possible, and
correct any issues that we saw. So we really wanted to
make sure that we were fast and that if something
broke, we could mitigate it quickly, and then try
and push the system and figure out any edge
cases as fast as possible. And then for us, too, it was
identifying the logical pieces that we wanted to move. For us, in that
dependency graph, it didn’t make sense to try
to more one piece at a time. The interdependencies
were too complicated. And it would’ve slowed us down. And so what we ended
up doing was, again, slicing everything vertically
and managing to just move it in bulk over and being
able to isolate it to a single team that
would work on moving their piece once the
infrastructure was in place. So, that’s it. Thank you. [MUSIC PLAYING]

About the author

Leave a Reply

Your email address will not be published. Required fields are marked *