If you want to use the strengths of any public clouds, you often have to move your data. Take machine learning, where Google Cloud Platform seems to have taken the lead: if you want to use TensorFlow as a service, your training datasets will have to be copied to GCP. Moreover, managing data on the level of application (in my case ML application) was something giving me a headache.
I used to move data to the cloud with ad-hoc solutions but that is inefficient and can lead to a high quantity of abandoned data occupying space. With Zenko, you can copy or move data to Google Cloud while keeping track of stray files, controlling your costs and making the process less painful.
The limits of uploading data straight into GCP
A common objection to installing Zenko is why not simply upload data into the cloud?
It depends on what you are doing. There is gsutilCLI tool and Google Storage Transfer offered by Google. The first one is slow and is good for small, one-time transfers, though you have to make sure you don’t end up terminating your command because gsutil can’t resume the transfer. Storage Transfer Services is scheduled as a job on GCP so you don’t have to guard it. If you transfer data from an external source, you pay for egress and operational GCP fees for using this service. It’s also worth mentioning rclone: it is handy to transfer data to GCP but doesn’t manage the transfers on the object level.
Zenko is an open source tool you can use to transfer and manage data between your on-prem location and desired locations in any public cloud. The key difference is that you are able to use one tool to continuously manage/move/backup/change/search the data.
You will need to set up your Zenko instance and register it on Zenko Orbit to proceed with this tutorial. If you haven’t completed that step, follow the Getting Started guide.
Step 1 – Create a bucket in Zenko local filesystem
This bucket (or multiple buckets) will be a transfer point for your objects. There are general naming rules in the AWS object storage world. These are the same rules you should follow when naming buckets on Zenko.
Creating a bucket on Zenko local filesystem
Step 2 – Add GCP buckets to Zenko
For each bucket in GCP storage that you want to add to Zenko, create another bucket with the name ending in “-mpu”. For example, if you want to have a bucket in GCP named “mydata”, you’ll have to create two buckets: one called “mydata” and another called “mydata-mpu”. We need to do this because of the way Zenko abstracts away the differences between various public cloud providers. S3 protocol uses a technique to split big files and objects into parts and upload them in parallel to speed up the process. When all the parts are uploaded it stitches them back together. GCP doesn’t have this concept so Zenko needs an extra bucket to simulate multipart upload (it’s one of the four differences between S3 and Google storage API we discussed before.)
Creating “-mpu” bucket on GCP for multipart upload
Find or create your access and secret keys to the GCP storage service to authorize Zenko to write to it.
Creating/getting access and secret keys from GCP Storage
Step 3 – Add your Google Cloud buckets to Zenko
You need to authorize access to the newly created GCP buckets by adding the keys (follow the instructions in the animation above). In this example, I have three buckets on GCP all in different regions. I will add all three to Zenko and later set the rules for data to follow and that will allow me to decide which data goes to which region on GCP.
Adding GCP buckets to “Storage locations” in Zenko
Now you can set up rules and policies that will move objects to the cloud. You have two options, replication or transition policies if your objective is moving data to GCP.
You can replicate data to Google Cloud Storage. And it can be as many rules as you like for different kinds of data. Zenko will create a replication queue using Kafka for each new object and if replication fails it will retry again and again.
Here is how to set a rule for replication. I am not specifying any prefixes for objects I wish to replicate but you can use this feature to distinguish between objects that should follow different replication rules.
Setting up object replication rules to GCP Storage
Another way to move data with Zenko is through a transition policy. You can specify when and where an object will be transferred. In this case, the current version of the object in Zenko local bucket will be transferred to a specified cloud location, GCP center in Tokyo in my example.
Creating a transition policy from Zenko to GCP Storage
As you can see there is no need for manual work. You just have to set up your desired storage locations once and create the rules to which all incoming data will adhere. It could be data produced by your application every day (Zenko is just an S3 endpoint) or big dataset you wish to move to GCP without sitting and hypnotizing the migration.
The objects counters for target clouds can get out of sync when objects are deleted before they are replicated across regions (CRR) or when deleted or old versions of objects are removed before delete operations are executed on the target cloud. If this happens, you need to reset the Zenko queue counters in Redis and below are the instructions to do it.
Step-by-step guide
To clear the counters you first need to make sure the replication queues are empty and then reset the counters in Redis.
1) Do check the queues, set maintenance.enabled = true and maintenance.debug = true for the deployment. This can be done by setting the values by enabling them in the chart and running a “helm upgrade” or by setting them with an upgrade like this:
This enables some extra pods for performing maintenance activities and debugging. After it’s done deploying make sure the “my-zenko-zenko-debug-kafka-client” pod is running.
2) Then you can enter the pod and check the queues:
% kubectl exec -it [kafka-client pod] bash
# List the avail queues (replacing "my-zenko-zenko-queue" with "[your name]-zenko-queue")
root@[pod-name]/opt/kafka# ./bin/kafka-consumer-groups.sh --bootstrap-server my-zenko-zenko-queue:9092 --list
3) Identify the target cloud replication groups relevant to the counters you want to reset and check the queue lag like this:
Check the “LAG” column for pending actions, lag should be zero if they are empty. If the queues for all of the targets are quiescent we can move on.
4) Now we can head over to a Redis pod and start resetting counters.
% kubectl exec -it my-zenko-redis-ha-server-0 bash
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli KEYS [location constraint]* |grep pending
# (for example: redis-cli KEYS aws-eu-west-1* |grep pending)
# This will return two keys, one for bytespending and one for opspending
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli KEYS aws-eu-west-1* |grep pending
aws-eu-west-1:bb:crr:opspending
aws-eu-west-1:bb:crr:bytespending
# Set the counters to 0
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli SET aws-eu-west-1:bb:crr:opspending 0
OK
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli SET aws-eu-west-1:bb:crr:bytespending 0
OK
Do this for each target location that you wish to clear.
Failed Object Counters
Failed object markers for a location will clear out in 24 hours (if they are not manually or automatically retried). You can force the to clear by setting the “failed” counters to zero. You’ll need to find the keys with “failed” in the text and delete them. Something like this:
##
# Grep out the redis keys that house the failed object pointers
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli KEYS aws-eu-west-1* |grep failed
##
# Now delete those keys
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli DEL [key name]
Developing and debugging a highly distributed system can be hard and sharing our learning is a way to help others. For everything else, please use the forum to ask more questions 🙂
Backbeat, a key Zenko microservice, dispatches work to long-running background tasks. Backbeat uses Apache Kafka, the popular open-source distributed streaming platform, for scalability and high availability. This gives Zenko functionalities like:
Asynchronous multi-site replication
Lifecycle policies
Metadata ingestion (supporting Scality RING today, with other backends coming soon)
As with the rest of the Zenko stack, Backbeat is an open-source project, with code organized to let you use extensions to add features. Using extensions, you can create rules to manipulate objects based on metadata logs. For example, an extension can recognize music files by artist and move objects in buckets named after the artist. Or an extension can automatically move objects to separate buckets, based on data type (zip, jpeg, text, etc.) or on the owner of the object.
All Backbeat interactions go through CloudServer, which means they are not restricted to one backend and you can reuse existing solutions for different backends.
The Backbeat service publishes a stream of bucket and object metadata updates to Kafka. Each extension applies its own filters to the metadata stream, picking only metadata that meets its filter criteria. Each extension has its own Kafka consumers that consume and process metadata entries as defined.
To help you develop new extensions, we’ve added a basic extension called “helloWorld.” This extension filters the metadata stream to select only object key names with the name “helloworld” (case insensitive) and when processing each metadata entry, applies a basic AWS S3 putObjectTagging where the key is “hello” and the value is “world.”
This example extension shows:
How to add your own extension using the existing metadata stream from a Zenko 1.0 deployment
How to add your own filters for your extension
How to add a queue processor to subscribe to and consume from a Kafka topic
There are two kinds of Backbeat extensions: populators and processors. The populator receives all the metadata logs, filters them as needed, and publishes them to Kafka. The processor subscribes to the extension’s Kafka topic, thus receiving these filtered metadata log entries from the populator. The processor then applies any required changes (in our case, adding object tags to all “helloworld” object keys).
Begin by working on the populator side of the extension. Within Backbeat, add all the configs needed to set up a new helloWorld extension, following the examples in this commit. These configurations are placeholders. Zenko will overwrite them with its own values, as you’ll see in later commits.
Every extension must have an index.js file in its extension directory (“helloWorld/” in the present example). This file must contain the extension’s definitions in its name, version, and configValidator fields. The index.js file is the entry point for the main populator process to load the extension.
Add filters for the helloWorld extension by creating a new class that extends the existing architecture defined by the QueuePopulatorExtension class. It is important to add this new filter class to the index.js definition as “queuePopulatorExtension”.
On the processor side of the extension, you need to create service accounts in Zenko to be used as clients to complete specific S3 API calls. In the HelloWorldProcessor class, this._serviceAuth is the credential set we pass from Zenko to Backbeat to help us perform the putObjectTagging S3 operation. For this demo, borrow the existing replication service account credentials.
Create an entry point for the new extensions processor by adding a new script in the package.jsonfile. This part may be a little tricky, but the loadManagementDatabase method helps sync up Backbeat extensions with the latest changes in the Zenko environment, including config changes and service account information updates.
Instantiate the new extension processor class and finish the setup of the class by calling the start method, defined here.
Update the docker-entrypoint.sh file. These variables point to specific fields in the config.json file. For example, “.extensions.helloWorld.topic” points to the config.json value currently defined as “topic”: “backbeat-hello-world”.
These variable names (i.e. EXTENSION_HELLOWORLD_TOPIC) are set when Zenko is upgraded or deployed as a new Kubernetes pod, which updates these config.json values in Backbeat.
Some config environment variables aren’t so apparent to add because we did not add them to our extension configs, but they are necessary for running some of Backbeat’s internal processes. Also, because this demo borrows some replication service accounts, those variables (EXTENSIONS_REPLICATION_SOURCE_AUTH_TYPE, EXTENSIONS_REPLICATION_SOURCE_AUTH_ACCOUNT) must be defined as well.
Where the Kubernetes deployment name is “zenko”. You must update the “backbeat” Docker image with the new extension changes.
With the Helm upgrade, you’ve added a new Backbeat extension! Now whenever you create an object with the key name of “helloworld” (case insensitive), Backbeat automatically adds object tagging, with a “hello” key and a “world” value to the object.
Have any questions or comments? Please let us know on our forum. We would love to hear from you.
We want to provide all the tools our customers need for data and storage, but sometimes the best solution is one the customer creates on their own. In this tutorial, available in full on the Zenko forums, our Head of Research Vianney Rancurel demonstrates how to set up a CloudServer instance to perform additional functions from a Python script.
The environment for this instance includes a modified version of CloudServer deployed in Kubernetes (Minikube will also work) with Helm, AWS CLI, Kubeless and Kafka. Kubeless is a serverless framework designed to be deployed on a Kubernetes cluster, which allows users to call functions in other languages through Kafka triggers (full documentation). We’re taking advantage of this feature to call a Python script that produces two thumbnails of any image that is uploaded to CloudServer.
The modified version of CloudServer will generate Kafka events in a specific topic for each S3 operation. When a user uploads a photo, CloudServer pushes a message to the Kafka topic and the Kafka trigger runs the Python script to create two thumbnail images based on the image uploaded.
This setup allows users to create scripts in popular languages like Python, Ruby and Node.js to configure the best solutions to automate their workflows. Check out the video below to see Kubeless and Kafka triggers in action.
As the media and entertainment industry modernizes, companies are leveraging private and public cloud technology to meet the ever-increasing demands of consumers. Scality Zenko can be integrated with existing public cloud tools, such as Microsoft Azure’s Video Indexer, to help “cloudify” media assets.
Azure’s Video Indexer utilizes machine learning and artificial intelligence to automate a number of tasks, including face detection, thumbnail extraction and object identification. When paired with the Zenko Orbit multi-cloud browser, metadata can be automatically created by the Indexer and imported as tags into Zenko Orbit.
Check out the demo of Zenko Orbit and Video Indexer to see them in action. A raw video file—with no information on content beyond a filename—is uploaded with Zenko Orbit, automatically indexed through the Azure tool, and the newly created metadata is fed back into Zenko as tags for the video file. Note that Orbit also supports user-created tags, so more information can be added if Indexer misses something important.
Why is this relevant?
Applications don’t need to support multiple APIs to use the best cloud features. Zenko Orbit uses the S3 APIs and seamlessly translates the calls to Azure Blob Storage API.
The metadata catalog is the same, wherever the data is stored. The metadata added by Video Indexer are available even if the files are expired from Azure and replicated to other locations.
Enjoy the demo:
Don’t hesitate to reach out on the Zenko Forums with questions.
Storing data in multiple clouds without a global metadata search engine is like storing wine bottles without labels in random shelves: the wine may be safe but you’ll never know which bottle will be appropriate for dinner. Using one object-based storage system can easily become complex but when you start uploading files to multiple clouds things can become an inextricable mess where nobody knows what is stored where. The good thing of object store is that objects are usually stored with metadata to describe them. For example, a video production company can include details to indicate that a video file is “production ready” or contain details about the department that produced the file, when raw footage was taken or the rockstar featured in a video. The tags we used to identify pictures of melons with Machine Box example are metadata, too.
Zenko offers a way to search metadata on objects stored across any cloud: whether your files are in Azure, Google Cloud, Amazon, Wasabi, Digital Ocean or Scality RING, you’ll be able to find all the videos classified for production or all the images of water melons.
The global metadata search capability is one of the core design principles of Zenko: one endpoint to control all your data, regardless of where it’s stored. The first implementation was using Apache Spark but the team realized it wasn’t performing as expected and switched to MongDB. Metadata searches can be performed from the command line or from the Orbit graphical user interface. Both searches use a common SQL-like syntax to drive a MongoDB search.
The Metadata Search feature expands on the standard GET Bucket S3 API. It allows users to conduct metadata searches by adding the custom Zenko querystring parameter, search. The search parameter is structured as a pseudo-SQL WHERE clause and supports basic SQL operators. For example, “A=1 AND B=2 OR C=3”. More complex queries can also be made using nesting operators, “(” and “)”.
The search process is as follows:
1. Zenko receives a GET request containing a search parameter:
GET /bucketname?search=key%3Dsearch-item HTTP/1.1
Host: 127.0.0.1:8000
Date: Wed, 18 Oct 2018 17:50:00 GMT
Authorization: <authorization string>
2. CloudServer parses and validates the search string: If the search string is invalid, CloudServer returns an InvalidArgument error. If the search string is valid, CloudServer parses it and generates an abstract syntax tree (AST).
3. CloudServer passes the AST to the MongoDB backend as the query filter for retrieving objects in a bucket that satisfies the requested search conditions.
4. CloudServer parses the filtered results and returns them as the response. Search results are structured the same as GET Bucket results:
You can perform metadata searches by entering a search in the Orbit Search tool or using the search_bucket tool. The S3 Search tool is an API extension to the AWS S3 search syntax. S3 Search is MongoDB-native, and addresses the S3 search through queries encapsulated in a SQL WHERE predicate. It uses Perl-Compatible Regular Expression (PCRE) search syntax. In the following examples, Zenko is accessible on endpoint http://127.0.0.1:8000 and contains the bucket zenkobucket.
$ node bin/search_bucket -a accessKey1 -k verySecretKey1 -b testbucket -q "`last-modified` LIKE "2018-03-23.*"" -h 127.0.0.1 -p 8000
Zenko’s global metadata search capabilities play a fundamental role in guaranteeing your freedom to choose the best cloud storage solution while keeping control of your data.