bweigel /aws-lambda-tesseract-layer

A layer for AWS Lambda containing the tesseract C libraries and tesseract executable.

120

GitHub

Public

Apache License 2.0

40.2 MB

Repository Statistics

Key metrics and engagement data

120

Stars

Forks

Open Issues

Releases

1.30

Engagement Rate

Default branch: main

Timeline

Repository has been active for N/A

Repository Created

Invalid Date(NaN years ago)

Last Activity
Inactive for NaN months

Invalid Date•NaN years ago

README.md

Tesseract OCR Lambda Layer

AWS Lambda layer containing the tesseract OCR libraries and command-line binary for Lambda Runtimes running on Amazon Linux 1 and 2.

:warning: The Amazon Linux AMI (Version 1) is being deprecated. Users are advised to not use Lambda runtimes (i.e. Python 3.6) based on this version. Refer also to the AWS Lambda runtime deprecation policy.

Quickstart
Ready-to-use binaries
- Use with Serverless Framework
- Use with AWS CDK
Build tesseract layer from source using Docker
Known Issues
- Avoiding Pillow library issues
- Unable to import module 'handler': cannot import name '_imaging'
Contributors :heart:

Quickstart

This repo comes with ready-to-use binaries compiled against the AWS Lambda Runtimes (based on Amazon Linux 1 and 2). Example Projects in Python 3.6 (& 3.8) using Serverless Framework and CDK are provided:

bash
1## Demo using Serverless Framework and prebuilt layer
2cd example/serverless
3npm ci
4npx sls deploy
5
6## or ..
7
8## Demo using CDK and prebuilt layer
9cd example/cdk
10npm ci
11npx cdk deploy

Ready-to-use binaries

For compiled, ready to use binaries that you can put in your layer see ready-to-use, or check out the latest release.

See examples for some ready-to-use examples.

Use with Serverless Framework

Serverless Framework

Reference the path to the ready-to-use layer contents in your serverless.yml:

yaml
1service: tesseract-ocr-layer
2
3provider:
4  name: aws
5
6# define layer
7layers:
8  tesseractAl2:
9    # and path to contents
10    path: ready-to-use/amazonlinux-2
11    compatibleRuntimes:
12      - python3.8
13
14functions:
15  tesseract-ocr:
16    handler: ...
17    runtime: python3.8
18    # reference layer in function
19    layers:
20      - { Ref: TesseractAl2LambdaLayer }
21    events:
22      - http:
23          path: ocr
24          method: post

Deploy

1npx sls deploy

Use with AWS CDK

AWS CDK

Reference the path to the layer contents in your constructs:

typescript
1const app = new App();
2const stack = new Stack(app, 'tesseract-lambda-ci');
3
4const al2Layer = new lambda.LayerVersion(stack, 'al2-layer', {
5    // reference the directory containing the ready-to-use layer
6    code: Code.fromAsset(path.resolve(__dirname, './ready-to-use/amazonlinux-2')),
7    description: 'AL2 Tesseract Layer',
8});
9new lambda.Function(stack, 'python38', {
10    // reference the source code to your function
11    code: lambda.Code.fromAsset(path.resolve(__dirname, 'lambda-handlers')),
12    runtime: Runtime.PYTHON_3_8,
13    // add tesseract layer to function
14    layers: [al2Layer],
15    memorySize: 512,
16    timeout: Duration.seconds(30),
17    handler: 'handler.main',
18});

Build tesseract layer from source using Docker

You can build layer contents manually with the provided Dockerfiles.

Build layer using your preferred Dockerfile:

bash
1## build
2docker build -t tesseract-lambda-layer -f [Dockerfile.al1|Dockerfile.al2] .
3## run container
4export CONTAINER=$(docker run -d tesseract-lambda-layer false)
5## copy tesseract files from container to local folder layer
6docker cp $CONTAINER:/opt/build-dist layer
7## remove Docker container
8docker rm $CONTAINER
9unset CONTAINER

available `Dockerfile`s

Dockerfile	Base-Image	compatible Runtimes
`Dockerfile.al1` (:warning: deprecated)	Amazon Linux 1	Python 2.7/3.6/3.7, Ruby 2.5, Java 8 (OpenJDK), Go 1.x, .NET Core 2.1
`Dockerfile.al2`	Amazon Linux 2	Python 3.8, Ruby 2.7, Java 8/11 (Coretto), .NET Core 3.1

Building a different tesseract version and/or language

Per default the build generates the tesseract 4.1.3 (amazonlinux-1) or 5.2.0 (amazonlinux-2) OCR libraries with the fast german, english and osd (orientation and script detection) data files included.

The build process can be modified using different build time arguments (defined as ARG in Dockerfile.al[1|2]), using the --build-arg option of docker build.

Build-Argument	description	available versions
`TESSERACT_VERSION`	the tesseract OCR engine	https://github.com/tesseract-ocr/tesseract/releases
`LEPTONICA_VERSION`	fundamental image processing and analysis library	https://github.com/danbloomberg/leptonica/releases
`OCR_LANG`	Language to install (in addition to `eng` and `osd`)	https://github.com/tesseract-ocr/tessdata (`<lang>.traineddata`)
`TESSERACT_DATA_SUFFIX`	Trained LSTM models for tesseract. Can be empty (default), `_best` (best inference) and `_fast` (fast inference).	https://github.com/tesseract-ocr/tessdata, https://github.com/tesseract-ocr/tessdata_best, https://github.com/tesseract-ocr/tessdata_fast
`TESSERACT_DATA_VERSION`	Version of the trained LSTM models for tesseract. (currently - in July 2022 - only `4.1.0` is available)	https://github.com/tesseract-ocr/tessdata/releases/tag/4.1.0

Example of custom build

bash
1## Build a Dockerimage based on Amazon Linux 2, with French language support
2docker build --build-arg OCR_LANG=fra -t tesseract-lambda-layer-french -f Dockerfile.al2 .
3## Build a Dockerimage based on Amazon Linux 2, with Tesseract 4.0.0 and french language support
4docker build --build-arg TESSERACT_VERSION=4.0.0 --build-arg OCR_LANG=fra -t tesseract-lambda-layer -f Dockerfile.al2 .

Deployment size optimization

The library files that are content of the layer are stripped, before deployment to make them more suitable for the lambda environment. See Dockerfiles:

Dockerfile
1RUN ... \
2  find ${DIST}/lib -name '*.so*' | xargs strip -s

The stripping can cause issues, when the build runtime and the lambda runtime are different (e.g. if building on Amazon Linux 1 and running on Amazon Linux 2).

Building the layer binaries directly using CDK

You can build the layer directly and get the artifacts (like in ready-to-use). This is done using AWS CDK with the bundling option.

Refer to continous-integration and the corresponding Github Workflow for an example.

Layer contents

The layer contents get deployed to /opt, when used by a function. See here for details. See ready-to-use for layer contents for Amazon Linux 1 and Amazon Linux 2 (TODO).

Known Issues

Avoiding Pillow library issues

Use cloud9 IDE with AMI linux to deploy example. Or alternately follow instructions for getting correct binaries for lambda using EC2. AWS lambda uses AMI linux distro which needs correct python binaries. This step is not needed for deploying layer function. Layer function and example function are separately deployed.

Unable to import module 'handler': cannot import name '_imaging'

You might run into an issue like this:

1/var/task/PIL/_imaging.cpython-36m-x86_64-linux-gnu.so: ELF load command address/offset not properly aligned
2Unable to import module 'handler': cannot import name '_imaging'

The root cause is a faulty stripping of libraries using strip here.

Quickfix

You can just disable stripping (comment out the line in the Dockerfile) and the libraries (*.so) won't be stripped. This also means the library files will be larger and your artifact might exceed lambda limits.

A lenghtier fix

AWS Lambda Runtimes work on top of Amazon Linux. Depending on the Runtime AWS Lambda uses Amazon Linux Version 1 or Version 2 under the hood. For example the Python 3.8 Runtime uses Amazon Linux 2, whereas Python <= 3.7 uses version 1.

The current Dockerfile runs on top of Amazon Linux Version 1. So artifacts for runtimes running version 2 will throw the above error. You can try and use a base Dockerimage for Amazon Linux 2 in these cases:

Dockerfile
1FROM: lambci/lambda-base-2:build
2...

or, as @secretshardul suggested

simple solution: Use AWS cloud9 to deploy example folder. Layer can be deployed from anywhere. complex solution: Deploy EC2 instance with AMI linux and get correct binaries.

Contributors :heart:

@secretshardul
@TheLucasMoore for providing a Dockerfile that builds working binaries for Python 3.8 / Amazon Linux 2

bweigel /aws-lambda-tesseract-layer

Repository Statistics

Timeline

Repository Created

Last Activity
Inactive for NaN months

Stars

bweigel/aws-lambda-tesseract-layer

Languages

README.md

Tesseract OCR Lambda Layer

Quickstart

Ready-to-use binaries

Use with Serverless Framework

Use with AWS CDK

Build tesseract layer from source using Docker

available `Dockerfile`s

Building a different tesseract version and/or language

Deployment size optimization

Building the layer binaries directly using CDK

Layer contents

Known Issues

Avoiding Pillow library issues

Unable to import module 'handler': cannot import name '_imaging'

Contributors :heart:

bweigel /aws-lambda-tesseract-layer

Repository Statistics

Timeline

Repository Created

Last ActivityInactive for NaN months

Stars

bweigel/aws-lambda-tesseract-layer

Languages

README.md

Tesseract OCR Lambda Layer

Quickstart

Ready-to-use binaries

Use with Serverless Framework

Use with AWS CDK

Build tesseract layer from source using Docker

available Dockerfiles

Building a different tesseract version and/or language

Deployment size optimization

Building the layer binaries directly using CDK

Layer contents

Known Issues

Avoiding Pillow library issues

Unable to import module 'handler': cannot import name '_imaging'

Contributors :heart:

Last Activity
Inactive for NaN months

available `Dockerfile`s