There are a number of considerations to account for when incorporating serverless architecture into your team's platform. It's particularly notable when the slice of the architecture in question comprises your internal data pipelines or workflow automation tools. Data is an important, if not the most important facet of this decision.
In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful […] I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
~ Linus Torvalds ~
As an example, if you have a team that is responsible for processing large PDF documents and extracting information from them programmatically, understanding the strengths and weaknesses of serverless offerings is crucial. Creating a logical representation of your data flow while maintaining a robust architecture without introducing significant DevOps overhead is possible but tricky.
AWS Lambda
Going with an AWS Lambda-first approach, for example, has its pros and cons. Setting up and running a Lambda for internal processing is relatively simple but there are a number of drawbacks. Writing a Lambda, in any of the provided runtimes, requires a specific pattern to be adhered to in order for the function to execute (e.g. def handler_name(event, context): return value
in a Python Lambda handler). This requires that additional infrastructure be spun up in in order to test or run the function locally which is possible but not trivial.
Depending on the Lambda's runtime, third-party packages can add to setup overhead. As an example, running a Python runtime Lambda that includes the popular requests
package requires pre-installing the package locally and ZIPing it into the deployment package. If you're not developing on a Linux environment, one option would be to set up an EC2 instance, SSH into it, install the package, ZIP it, and then download it to be included in the Lambda deployment package.
Lambda functions also have limited execution sizes with the maximum memory configuration at 3008 MB. There is an additional 512 MB provided in the tmp/
directory but this is, as its name implies, temporary and is not guaranteed to be maintained between invocations. Typical workarounds to this limitation include splitting the files on some other service, such as AWS EC2, and then storing them in S3 and fetching them directly from the Lambda. Creating chains of Lambdas via AWS Step Functions is another way to handle this.
Another factor is timeout limitations placed on a Lambda execution with a current maximum of 15 minutes. Some scripts or processes take longer and one workaround would be to split the process into chunks to execute in Step Function Parallel States or in fan out patterns with subsequently invoked Lambdas. This limitation, coupled with cold start delays when invoking a Lambda less frequently, can raise reliability concerns.
Increasing the memory configuration and timeout can alleviate these issues somewhat, the flip side is that it may noticeably increase architecture costs. Since usage costs are calculated as a function of memory size, invocation, and duration, this can become prohibitive quickly.
Shipyard
Shipyard, in many ways, is a heavy-duty version of the serverless platform offered by Lambda.
Whatever script you write and run locally can be run on the platform without any code changes or proprietary configuration. Additionally, the requirements configuration option allows for third-party packages to be included similar to how a requirements.txt
file would handle them. Both available memory and execution limits are significantly higher given that Vessels were designed for heavy data loads and run in dedicated containers for each Voyage.
For processing PDFs, Tesseract could be directly installed on a Vessel, text could be pulled from large, multi-page files, and results could be posted to an external endpoint using requests
or uploaded to S3 using boto3
with either package installed via the requirements configuration page in the Shipyard application.
Shipyard bridges the gap between lighter-weight serverless function services such as AWS Lambda and the non-serverless but heavier-weight options such as AWS EC2 or AWS ECS that require additional DevOps work to get up and running. In a lot of ways, Shipyard can be seen as a robust, heavy-duty platform for internal workflow automation. This design makes it easier for teams to focus on building data pipelines without having to create some of the workarounds mentioned to fit the system.
About Shipyard:
Shipyard is a modern data orchestration platform for data engineers to easily connect tools, automate workflows, and build a solid data infrastructure from day one.
Shipyard offers low-code templates that are configured using a visual interface, replacing the need to write code to build data workflows while enabling data engineers to get their work into production faster. If a solution can’t be built with existing templates, engineers can always automate scripts in the language of their choice to bring any internal or external process into their workflows.
The Shipyard team has built data products for some of the largest brands in business and deeply understands the problems that come with scale. Observability and alerting are built into the Shipyard platform, ensuring that breakages are identified before being discovered downstream by business teams.
With a high level of concurrency and end-to-end encryption, Shipyard enables data teams to accomplish more without relying on other teams or worrying about infrastructure challenges, while also ensuring that business teams trust the data made available to them.
For more information, visit www.shipyardapp.com or get started for free.