logo bigbox
Groundbreaking solutions. data transformation.

Whether your business is early in its journey or well on its way to digital transformation, our solutions and technologies help chart a path to success.

Learn more...

Why Bigbox

Choosing Bigbox
Reasons why companies choose us
About Us
Get closer with us
Events
Read the latest stories and product updates
Locations
Check our locations
Partners
see our partners
Resources
Check our resources

Data Curation

  1. Data Curation - (Database content profiling)

There is a visualization (dashboard) that contains profiles and statistics of the number of tables, rows, primary keys, descriptive statistics of the data, correlations between tables, and various other metadata and glossary information.

  1. Data Curation - (Cleansing, Validating, Standardizing)

The data curation process occurs in 2 stages:


  • When data integration and data acquisition. The tool used is BigAction, which already has standard operators to perform cleaning, validating standards. The curation mechanism adapted to business rules is not yet available, but it can be built and developed.

  • When processing and processing data in data lake (BigLake), apart from using standard curation like in BigAction, curation can also be done with workflow-based rule engines and machine learning.

  1. Data Curation - (Matching, Linking, and Merging)

Similar to number 42, the data curation process for Matching, Linking, and Merging occurs in 2 stages:


  • When data integration and data acquisition. The tool used is BigAction, which already has standard operators to perform cleaning, validating standards. Such as MergeContent processor which is available in dataflow creation. Other processors have different functions.

  • The curation mechanism adapted to business rules is not yet available, but it can be built and developed.

  • When processing and processing data in data lake (BigLake), apart from using standard curation as in BigAction, curation can also be done with workflow-based rule engines and machine learning.

  1. Data Curation - (Data Quality Rule Engine)

The Data Quality Rule Engine (DQRE) can be built based on the Spark (batch) or Spark Streaming (micro-batch/stream) engines, which allow data curation processes to be carried out on-read and on-write. The DQRE can be integrated in the dataflow by BigAction, so that it becomes a processor that can be utilized for any dataflow and data pipeline.

  1. Data Curation - (Interactive Data Quality Visualization/Analytics)

A statistical visualization of all Data Quality processes will be available on the BigOne dashboard. And if a more attractive, dynamic and customized visualization is needed, it can be developed using BigQuery or BigBuilder.

  1. Data Curation - (Data quality process management (collaborative))

When processing and processing data in datalake (BigLake), apart from using standard curation as in BigAction, curation can also be done with workflow-based rule engines and machine learning. The workflow-based Data Quality Rule Engine will be collaborative.

  1. Data Curation (MDM) - (Entity Resolution (MDM))

Not yet available as a feature. But it can be developed as a use case.

  1. Data Curation (MDM) - (Hierarchy management (MDM))

Not yet available as a feature. But it can be developed as a use case.

  1. Data Curation (MDM) - (Golden Record (MDM))

Not yet available as a feature. But it can be developed as a use case.

  1.  Data Curation (MDM) - (Machine Learning Support to automate the process (MDM))

Available as a platform, to be developed as a use case.

Data Management

  1. Data Management - (Collecting metadata from various types of data sources, ETL, or other metadata repositories)

BigOne performs the acquisition of extensive technical and business metadata (e.g. schemas, tables, procedures, attributes) defined during connection to data sources. And through the REST API BigAction can be used to perform metadata exchange with tools such as Apache Atlas and Cloudera Navigator. Data sources that can be acquired metadata include all types of RDBMS, No SQL databases to document files.

  1. Data Management - (Distributed metadata harvesting server)

By using the BigAction agent, data and metadata collection can be distributed at the position of the data source or edge client. The specific objectives of having a BigAction agent are:

  • small and lightweight footprint

  • central management of agents

  • generation of data provenance

  • integration with BigAction for follow-on data flow management

  • This perspective of the BigAction Agent role is the perspective of an agent acting directly at the position/location of a data source, source sensor, system, or server.

  1. Data Management - (Semantic type data management)

Semantic management, data dictionary, glossary and others, will use the taxonomy and tagging methods in the BigOne framework.

  1. Data Management - (Data management glossary)

Semantic management, data dictionary, glossary and others, will use the taxonomy and tagging methods in the BigOne framework.

  1. Data Management - (Auto tagging with semantic type, profiling, and data sampling)

Users can tag fields, profile data and sample data using the taxonomy and tagging methods in the BigOne framework.

  1. Data Management - (Masking sensitive data inside the sample data view)

The masking process can occur when on-read via the BigEnvelope API, so masking occurs at the fire level. Can also be on-write via processor masking in BigAction.

  1. Data Management - (Workflow / Role-Based Usability (Data stewards, business analyst, line of business manager))

Data Stewardship with BigOne the concept is to collect, create and use metadata.

By adding technical metadata generated using the organization's business vocabulary, users can add metadata to entities using labels, classifications, attributes, and a glossary. The glossary allows you to identify synonyms so that vocabulary from different teams doesn't get in the way of identifying the same data.

Users can plan as a whole as to what kind of metadata to apply, design some conventions about how to apply them and who can apply them. Design multiple processes to monitor metadata being collected.

  1. Data Management - (Stewardship management)

BigOne has workflow management for data management, metadata, and data generation management, which will involve many functions, PICs, and roles. User Assignment process as a Data Steward (admin/data manager), as well as other functions, can be done easily.

  1. Data Management - (Lineage tracing)

The lineage tracing process is carried out in 2 categories:

  • First, when integrating with data sources, data retrieval to storage in datalake. At this stage, the lineage tracing process is provided by the BigAction module.

  • There is a graphical representation of the lineage or path a FlowFile takes in the dataflow.

  • Second, during data processing to form a data mart. At this stage the lineage tracing process is provided by the BigOne Module. There is a graphical representation of the lineage in the action data.

  1. Data Management - (The process of preparing metadata and cataloging is collaborative)

The metadata preparation and cataloging process is collaborative. For every technical metadata that is collected from all data sources, the formation of an organization's business vocabulary can be done collaboratively based on a workflow.

  1. Data Management - (Data catalog can be consumed via REST API)

The Data Catalog, as well as technical and business metadata, as well as various organization's business vocabulary, are searched through the BigSearch search engine. And for the search process that is already standard, it is exposed to a REST API by BigEnvelope.

  1. Data Management - (Data marketplace (explore and share))

  • Search, access and share data sets in a data portal (sort of marketplace).

  • Searches can be made on BigOne and the BigBox Portal. As for the dataset share, it can be provided on a portal or a portal provided by BigBox.

  1. Data Management - (Access control with integration with Active Directory)

  • LDAP with Kerberos

    • LDAP and Kerberos together make a great combination. Kerberos is used to securely manage credentials (authentication) while LDAP is used to store authoritative information about accounts, such as what they are allowed to access (authorization), user's full name, and uid. You can also add useful things like external email addresses or something else in a structured way.

    • Most other LDAP setups involve storing passwords in the LDAP directory itself using the userPassword attribute.

Data Ingestion

  1. Data Ingestion (Supports ingestion from multiple data sources)

For data access, BigAction supports connectivity via the JDBC API and the use of SQL query syntax. With JDBC and SQL, there is no need to write separate applications or use different query syntaxes to access different database systems (Oracle, Salesforce, Teradata, PostgreSQL, Tibero for example). In addition, BigAction also has library connectivity to No SQL databases such as HDFS connectivity, MongoDB, Cassandra, HBase, Redis, and others. As well as the capability to read and extract unstructured data from document files (csv, xls, doc, ptt, etc.), pdf files, image files, even video and voice.

  1. Data Ingestion - (Able to integrate & retrieve data from various data sources (Ex : HTTP, HTTPS, API, and SFTP communication protocols)

BigAction has libraries that support various communication protocols such as HTTP, HTTPS, API and SFTP. Here are some examples:

  • GetHTTP:

This is the operator for fetching data from HTTP or HTTPS URLs and writing data to the content of the FlowFile. Once the content is fetched from the given URL, it will not be fetched again until the content on the remote server (data source) changes.

  • InvokeHTTP:

An HTTP client processor that can interact with configured HTTP Endpoints. Destination URL and HTTP Method are configurable. The FlowFile attribute is converted to an HTTP header and the content of the FlowFile is included as the request body (if the HTTP Method is PUT, POST, or PATCH).

  1. Data Ingestion - (Bulk/Batch Data Movement)

To perform Data Ingestion, BigAction supports doing data Ingestion/data Movement in Bulk/Batch Ingestion with an already available processor.

  1. Data Ingestion - (Stream Data Integration, including from Change Data Capture to Big Data with upsert support)

Examples of change data capture operators: CaptureChangeMySQL and CDC with Apache NiFi (Stream Data Integration, including from Change Data Capture to Big Data with upsert support)

  • CaptureChangeMySQL:

Retrieve the Change Data Capture (CDC) event from the MySQL database. CDC Events include INSERT, UPDATE, DELETE operations. Events are output as individual stream files ordered by the time the operation occurred.

  • CDC with Apache NiFi:

Process of using CaptureChangeMySQL, EnforceOrder, and PutDatabaseRecord processors in Apache NiFi to replicate databases using MySQL event logs

  1. Data Ingestion - Process data both batch and realtime/streaming (publish-subscribe messaging)

BigAction can perform integration functions and process data in batch and realtime/streaming with Kafka with BigAction as its flow manager

Data Storage

  1. Data Storage - (HDFS or Data Lake storage)

Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has a lot in common with existing distributed file systems. However, the difference from other distributed file systems is quite significant. HDFS is highly fault tolerant and designed for use on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

  1. Data Storage - (Access control with integration with Active Directory)

Kerberos is a computer network authentication protocol, which provides secure Single Sign On (SSO) based on a trusted third party mutual authentication service.

  • This is secure because the user's password is never transmitted over the network. Kerberos uses Tickets negotiated with the server, with a limited life time.

  • This is an SSO system, one ticket can be used by all services until the validity period expires.

  • It acts as a trusted third party because all user and service keys are managed by the Kerberos server.

  • It is a mutual authentication system that guarantees not only that the user is who he pretends to be, but also because each user is guaranteed that the service accessed is the expected service.

  • Kerberos is widely used in the Microsoft™ world, because all authentication mechanisms in Microsoft™ are carried out through this protocol.

  1. Data Storage - (Data Security (Encryption))

HDFS implements transparent end-to-end encryption. Once configured, data read from and written to a dedicated HDFS directory is transparently encrypted and decrypted without requiring changes to the user's application code. This encryption is also end-to-end, meaning data can only be encrypted and decrypted by the client. HDFS never stores or has access to unencrypted data or unencrypted data encryption keys. It satisfies two general requirements for encryption: at-rest encryption (meaning data on a persistent media, such as a disk) as well as in-transit encryption (e.g. when data travels over a network).

  1. Data Storage - (High Performance SQL Database (MemSQL))

How the SingleStore DB works:

  • SingleStore DB is a distributed relational database that handles real-time transactions and analytics at scale. It is accessible via the standard SQL driver and supports ANSI SQL syntax including joins, filters, and analytical capabilities (e.g. aggregate functions, group by, and windowing).

  • SingleStore DB scales horizontally on cloud instances or industry-standard hardware, providing high throughput across multiple platforms.

  • The SingleStore DB database maintains broad compatibility with common technologies in modern data processing ecosystems (eg orchestration platforms, developer IDEs and BI tools), so you can easily integrate them in your existing environment. It features in-memory rowstore and on-disk columnstore to handle highly concurrent operational and analytical workloads.

  • SingleStore DB also features a data ingestion technology called SingleStore Pipelines that flows large amounts of high-throughput data into a database with exact semantics once.

  1. Data Storage - (High performance Graph Database)

Can be integrated with high performance graph databases such as TigerGraph or Redis. Especially for integration with Redis, BigAction already has a RedisConnectionPoolService processor. Likewise for integration with TigerGraph, you can use the JDBC library.

  1. Data Storage - (Support spatial data type)

SingleStore supports the use of geospatial data types

  • SingleStore has added support for geospatial queries. The term "geospatial" covers a very wide range of features. There is a formal standardization for some of these features, known as SQL-MM or OpenGIS. The features in SingleStore are not complete implementations of the standard, but follow the same structure. At a basic level, geospatial support requires three things: a data type, a topology function, and a measurement function.

  • The three main types of geospatial objects are points, paths, and polygons. We support a standard subset of "Recognized Text" (WKT) syntax to describe it. For example, POINT is a simple longitude/latitude pair.

  1. Data Storage - (Full Text Search)

  • ASOLR cache.

    • Advanced Full Text Search Capabilities.

      • Powered by Lucene™, Solr enables powerful match capabilities including phrases, wildcards, joins, grouping and more across all data types.

  1. Data Storage - (Able to Handle Multiple User Sessions)

Single store is distributed SQL and has Aggregator & leaf schema. making it possible to handle many user sessions.

  1. Data Storage - (Able to Handle Data and Still Able to Provide Fast Query Response (<1 second))

  • SingleStore provides the interactive response time users want, on large data sets, with concurrent access, without using pre-computed results.

  • Based on the picture above, SingleStore is able to query 1.2 Billion Rows of Data in approximately 0.045 seconds.

  1. Data Storage - (Memiliki Fungsi Machine Learning in Database)

  • SingleStore offers an ultra-fast query and ingest platform that enables real-time model assessment on streaming and historical data.

  • SingleStore offers the ideal architecture for ML and AI applications with an all-in-one database solution that can rapidly model and assess rapidly changing data for real-time response to dynamic conditions driven by customers, machines or 24/7 operations.

  1. Data Storage - (Has the ability to store various data formats, be it structured, semi-structured or unstructured)

Databases can store structured, semi-structured and unstructured data

  • HDFS usually stores large data (gigabytes to petabytes) and also supports data storage such as structured, semi-structured, and unstructured data.

  1. Data Storage - (Able to perform the insert-update-select process on one table simultaneously (ACID compliant))

Data Storage → (ACID Compliance)

  • Apache Hive

Apache Hive can perform ACID (atomicity, consistency, isolation, and durability) v2 transactions at row-level without any configuration. By default, managed tables are ACID tables. You cannot disable transactions.

  • SingleStore

Transactional support in SingleStore is similar to that of any NoSQL system, as both SingleStore and NoSQL systems can perform continuous loading. The difference is that SingleStore is ACID compliant. SingleStore brings every transaction to disk, so there is no risk of data loss. SingleStore has a classic logging mechanism, so we can perform HA configuration and transaction replication to other nodes or clusters to ensure data availability.

  1. Data Storage - (BigLake Manager)

  • BigLake Manager functions as a GUI-based dashboard for the management and monitoring of Big Data infrastructure that can be accessed via a browser, which at least includes user management, access management to data objects, node, cluster or multicluster management, resource and load management (Workload Management).

  • BigLake Manager functions as a dashboard for the management and monitoring of Big Data infrastructure that can be configured and integrated with an email server to get notifications in the form of emails that, at a minimum, include job activities, resources and availability, system errors/warnings.

  1. Data Storage - (BigLake HDFS)

  • BigLake has an HDFS component that can easily add nodes (scaling out) / patches without causing any service downtime or having minimal impact on the ecosystem.

  • BigLake uses HDFS as a storage component where every data stored in HDFS will be replicated with a form factor of 3 so that it supports the fault tolerance feature for data if a server is down or a node is not functioning.

  • Accessibility (Read and Write) by ETL tools such as InnoQuartz, and others or data integration solutions such as Talend, and others

  1. Data Storage - (Data Security Module)

BigLake supports various security features, such as authentication, access authorization, audit trail, data masking, data encryption using components such as Ranger, Knox, Kerberos

  1. Data Storage - (Manage the allocation of computing resources)

BigLake has the ability to divide resource allocation between each user or group based on CPU, storage, and memory needs with components such as YARN

Data Provisioning

  1. Data Provisioning - (High Performance OLAP Database)

  • Apache Druid

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. Druid is most often used as a database to support use cases for real-time ingest, fast query performance, and high uptime is important. As such, Druids are typically used to power the GUI of an analytic application, or as a backend for highly concurrent APIs that require fast aggregation. Druids work best with event-oriented data.

  • Common application areas for Druids include:

    • Clickstream analytics (web and mobile analytics)

    • Network telemetry analytics (network performance monitoring)

    • Server metrics storage

    • Supply chain analytics (manufacturing metrics)

    • Application performance metrics

    • Digital marketing/advertising analytics

    • Business intelligence / OLAP

  1. Data Provisioning - (Can form data marts)

  • BigLake can form data marts for feeding data to Dashboard / Visualization

Data Serving

  1. Data Serving - (Facilitates queries with SQL or No SQL)

BigQuery Facilitates SQL and NoSQL queries. If the user needs to analyze a question that cannot be expressed using the question builder, the user can use SQL instead. Users can easily write SQL queries directly on the query editor menu that has been provided

Data Consumption

  1. Data Consumption (Business Intelligence) - (Able to access various data sources)

BigQuery can work with many types of databases, such as:

  • BigQuery (Google Cloud Platform)

  • Druid

  • Google Analytics

  • H2

  • MongoDB

  • MySQL

  • Oracle

  • PostgreSQL

  • Presto

  • Redshift (Amazon Web Services)

  • Snowflake

  • SparkSQL

  • SQL Server

  • SQLite

  • Vertica

  1. Data Consumption (Business Intelligence) - (Manufacture and consumption reporting)

The BigQuery dashboard consists of several portlets arranged according to the available grid. This portlet can be a combination of tables, graphs, maps or text that can support the reporting or dashboard creation process. Users can also add a filter widget to filter data across multiple queries so that chart or table data will automatically adjust.

  • Create dashboards and reporting instantly
    At the top right of the screen, click the + icon to open the menu and select Create new dashboard. Give a name and description to the dashboard that has been created, choose which collection to use the dashboard, then click Create, and BigQuery will go to the new dashboard page. If the user doesn't want to create a dashboard from scratch, or wants to experiment with making changes to an existing dashboard without affecting the original dashboard, the user can duplicate an existing dashboard.

  • Adding Questions to Dashboard

There are two ways to add a question to the BigQuery dashboard: from the dashboard directly, or from the question you want to add.

  1. From a dashboard : Click the pencil icon to edit the dashboard. Then click the + icon at the top right of the editing dashboard (not the + icon in the main navigation) to add any saved questions to the dashboard

  2. From a question : Users can add a new saved question directly to the dashboard from the pop-up that appears after saving a question for the first time. Users can also add questions to the dashboard by clicking on the pencil icon next to the question name, and selecting Add to dashboard.

After the user adds a question to the dashboard, it will look like the following:


  • Add Description Text
    Users can add text to the dashboard such as descriptions, explanations, notes, or even images and GIFs to the dashboard easily. To add new text, select the Aa button, at the top right:

  • Assembling Portlets In Dashboard
    Each question or portlet on the dashboard, users can move it by drag & drop or resize it as desired in the available grid. Select the pencil icon at the top right of the dashboard to enter the dashboard editing interface.

  1. To move the portlet, simply click and drag the portlet to the desired position

  2. To resize the portlet, click in the lower right corner of the portlet, and drag to resize as desired

  3. To delete a portlet, you can hover over the portlet, and click the X icon in the upper right corner.

  • Automatic Updates, If the data used is updated frequently, the user can set the dashboard to update the data automatically by clicking the clock icon.kon jam.

Users can change dashboard settings to update at 1, 5, 10, 15, 30, and 60 minute intervals, depending on how often data updates are required. The system will rerun all queries on the dashboard according to the selected time interval.

  1. Data Consumption (Business Intelligence) - (Dashboard)

The BigQuery dashboard consists of several portlets arranged according to the available grid. This portlet can be a combination of tables, graphs, maps or text that can support the reporting or dashboard creation process. Users can also add a filter widget to filter data across multiple queries so that chart or table data will automatically adjust.

  • Create dashboards and reporting instantly
    At the top right of the screen, click the + icon to open the menu and select Create new dashboard. Give a name and description to the dashboard that has been created, choose which collection to use the dashboard, then click Create, and BigQuery will go to the new dashboard page. If the user doesn't want to create a dashboard from scratch, or wants to experiment with making changes to an existing dashboard without affecting the original dashboard, the user can duplicate an existing dashboard.

  1. Data Consumption (Business Intelligence) - Collaboration and Sharing Features

BigQuery supports multi-tenancy schemes that allow collaboration between users, as well as sharing the resulting data insights. In managing data, BigQuery Admin can manage access for each user with different data access. Here are some settings that can be done:

  1. Embedding Applications, users can embed charts or dashboards into various systems with various methods including:

  • Public Embedding: the user shares with other users a public link that directs a link to a page, the link can be used by other applications or embedded via an iframe

  • Private embedding: Just like public embedding, the difference is that the iframe is secured by a JSON Web Token (JWT). Users can make restrictions according to access that are relevant to their access rights.

<em>Fig. 3</em>. Sharing options include: <strong>Public link</strong>, <strong>Public embed</strong>, and <strong>Embed this dashboard in an Application</strong> (a signed embed).

  1. Single Sign-On (SSO)
    by using SSO, users can avoid login screens that shouldn't be where the user's graph should be. With SSO users can coordinate permissions of other users across BigQuery applications.


<em>Fig. 5</em>. Add attributes to people in Metabase to coordinate user permissions between your app and your embedded Metabase instance.

  1. Data Sandbox
    Users can parameterize attributes that can be used to filter what each user can see and access, down to specific table columns and rows.


  1. Data Consumption (Business Intelligence) - (Alerting)

BigQuery allows users to get notifications/alerts based on certain parameters such as revenue, users, negative reviews or other notifications. Users can choose several alternative types of alerts that can be set and selected to provide notifications either via email or slack. There are three types of alerts / alerts contained in BigQuery:

  1. When the time series crosses the target or threshold

  2. When the progress bar reaches or is far from the target

  3. When the questions that have been made bear fruit.

In addition, BigQuery administrators can view and edit individual alerts on all stored queries. Users will also see some additional options for adding recipients that look like this :



  1. Data Consumption (Business Intelligence) - (Visualization)

Users can set a visualization of an interaction or an action when someone clicks on a question or portlet on the dashboard. By default, when a user creates a graph using the query builder, the graph will automatically be equipped with search capabilities to explore and explore further. Here are some examples of actions that can be done interactively, namely:

  • Data search, allows users to perform data deepening actions from portlets or created queries.

  • Users can send other users to specific destinations, such as: dashboard, questions/portlets, or custom URLs when clicking on questions/portlets

  • Using the dashboard filter to perform filtering as needed, so that the graphs, tables contained in the dashboard will automatically follow the filter parameters.

  1. Data Consumption (Business Intelligence) - (Supports interactive analysis )

  2. Data Consumption (Business Intelligence) - (Export data to excel)

With BigQuery users can easily download the result of a portlet or question by clicking the Download arrow at the bottom right of the screen. Results can be downloaded as a .csv, .xlsx, or .json file. For the maximum download size is 1 million lines.

  1. Data Consumption (Business Intelligence) - (facilitates search)

  2. Data Consumption (Business Intelligence) - (Data storytelling)

The instant analysis feature of BigQuery is X-ray. X-rays are a quick and easy way to automatically gain insight and exploration of data. If users are confused about starting data exploration, the X-ray feature helps to scan the data set to produce an overview and short story from the data.

  1. Data Consumption (Business Intelligence) - (Report generation with map view and data) geospatial

BigQuery can generate reports with map views and geospatial data. When a user selects a map visualization setting for the first time, BigQuery will automatically select the best map type based on the available data table. By default, BigQuery uses OpenStreetMaps for map visualization, but there are several customization options that can be made, such as GeoJSON format or others. There are three types of maps that can be created:

  • Pin Map to mark specific locations based on latitude and longitude

<em>Fig. 3</em>. Example of a pin map made from the people table. Hovering over one of the pins is revealing additional information about the person represented by the pin.

  • Region Map to group data by country or state. This map of the region requires that the user data have a column with the field type "Region" or "Country".

  • Grid Map is a visualization that distributes points in a certain area by creating colorful gradients that cover the map. Grid maps are a great way to visualize the areas experiencing or having the most activity.

  1. Data Consumption (Business Intelligence) - (Has function to form OLAP cubes)


  1. Data Consumption (Big Builder) - (Interaction with multiple data sources)

BigBuilder has a data builder that allows interaction (querying) of various data sources including Spark and Impala

Data Operation

  1. Data Operation - (Workload Management)

  • Apache YARN The basic idea of YARN is to split the resource management and job scheduling/monitoring functionality into separate daemons. The idea is to have a global ResourceManager (RM) and an ApplicationMaster (AM) per application. Application is single job or DAG of multiple jobs

  1. Data Operation - (Monitoring and Alerting)

  • Apache Ambari aims to make Hadoop management simpler by developing software to provision, manage, and monitor Apache Hadoop clusters. Ambari provides an intuitive and easy-to-use Hadoop management web UI backed by its RESTful API.

  • Ambari allows System Administrators to:

  • Hadoop Cluster Provision

  • Ambari provides a step-by-step guide to installing Hadoop services on a number of hosts.

  • Ambari handles the Hadoop service configuration for the cluster.

  • Managing Hadoop Clusters

  • Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the cluster.

  • Monitoring Hadoop clusters

  • Ambari provides a dashboard to monitor Hadoop cluster health and status.

  • Ambari utilizes the Ambari Metric System for metric collection.

  • Ambari leverages the Ambari Alert Framework for system alerts and will notify you when your attention is needed (eg, dead nodes, remaining disk space running low, etc).

  • Ambari enables Application Developers and System Integrators to easily integrate Hadoop provisioning, management and monitoring capabilities into their own applications with the Ambari REST API.


  1. Data Operation (Environment Management) - (Management of resource allocation to users/groups)

  • Apache YARN

  1. YARN is to split the resource management and job scheduling/monitoring functionality into separate daemons. The idea is to have a global ResourceManager (RM) and an ApplicationMaster (AM) per application. Application is a single job or DAG of multiple jobs.

  2. Apache YARN Resource Management (CGroups)

  • Has a function to isolate heavy processes that use the CPU. When using CPU Scheduling, CGroups also need to be used to limit and manage CPU usage

  1.  Data Operation (Environment Management) - (Resource management for data science activities)

Apache YARN:

  • YARN is to split the resource management and job scheduling/monitoring functionality into separate daemons. The idea is to have a global ResourceManager (RM) and an ApplicationMaster (AM) per application. Application is a single job or DAG of multiple jobs.

  • Apache YARN Resource Management (CGroups)

  • Has a function to isolate heavy processes that use the CPU. When using CPU Scheduling, CGroups also need to be used to limit and manage CPU usage

Advanced Analytics And Data Science

  1. Advanced Analytics and Data Science - (Has an analytics platform working environment for complete machine learning-based data modeling and supports collaborative development)

Kaggle offers a customizable, setup-free Jupyter Notebooks environment. Free GPU access and community-published data & code repository

  1. Advanced Analytics and Data Science - (Has Evaluation Model function (Confusion Matrix, Cross Validation, AUC))

Apache Spark:

  • Spark.mllib comes with a number of machine learning algorithms that can be used to learn and make predictions on data. When these algorithms are applied to build machine learning models, there is a need to evaluate the performance of the model on several criteria, which depend on the application and its requirements. spark.mllib also provides a set of metrics for the purpose of evaluating the performance of machine learning models. There are several evaluation models:

    • Classification Model Evaluation

  • Binary Classification

    • Threshold Tuning

    • Multiclass Classification

      • Label based Metrics

    • Multilabel Classification

    • Ranking Systems

  • Regression Model Evaluation

  • Support for Evaluation models using Confusion Matrix, Cross Validation and AUC

  1.  Advanced Analytics and Data Science - (Able to make flow development, operation, and monitoring of deployed machine learning models)

BigLake Notebook can perform MLOps covering development flow, operations, monitoring of deployed models against machine learning models already deployed with components such as Spark, Jupyter, and Zeppelin. This includes:

  • List Scheduler, Task dan Stage

  • Summary of RDD size and memory utilization

  • Environment Information

  • Information of running executor

  1. Advanced Analytics and Data Science - (Supports deep learning based data modeling Data Science -> Machine Learning using Deep Learning)

  • Apache Spark is the primary enabling platform for distributed deep learning, as it allows various deep learning frameworks to be embedded into the Spark framework in a secure end-to-end pipeline.

  1. Advanced Analytics and Data Science - (The AI/ML libraries/tools are easy to upgrade or integrate with the latest libraries/tools if needed)

Data Science → Update/Upgrade Library for AI/ML

  • Apache Spark MLLib allows for library updates using the DataFrame-based API

  1. Advanced Analytics and Data Science - (Supports types of analytics such as path analytics and text analytics)

Data Science → ML Path and Text Analytic

  • Apache Spark is able to perform text analytics because the platform is scalable and distributed computing. in the form of:

    • Text Mining (Text Clustering, Data-Driven Topics)

    • Categorization (Tagging Unstructured Data into categories and sub-categories; hierarchy and taxonomy)

    • Entity Extraction (Extracting patterns such as phrases, addresses, product codes, phone numbers etc.)

    • Sentiment Analysis (Tagging positive, negative, neutral and other sentiment levels)

    • Deep Linguistics (Semantics. Understanding causality, purpose, time etc.)

  1. Advanced Analytics and Data Science - (Have the ability to descriptive analytics, predictive analytics, prescriptive analytics, path analytics, text analytics and others)

Data Science → Descriptive, Predictive, Prescriptive Analytics, path analytics, text analytics and others

  • Apache Spark is able to perform analytics such as descriptive, prediktif, preskriptif, path analytics, text analytics and others

  1. Data Operation (Environment Management) - (Resource management for data science activities)

Apache YARN:

  • YARN is to split the resource management and job scheduling/monitoring functionality into separate daemons. The idea is to have a global ResourceManager (RM) and an ApplicationMaster (AM) per application. Application is a single job or DAG of multiple jobs.

  • Apache YARN Resource Management (CGroups)

    • Has a function to isolate heavy processes that use the CPU. When using CPU Scheduling, CGroups also need to be used to limit and manage CPU usage


  1. Advanced Analytics and Data Science - (Supports interactive Data Scientist user sessions with a minimum of Python and R)

Data Science → Support Minimum language Python and R

  • Apache Spark support for languages like Java, Scala, Python, R, and SQL

  • Zeppelin support for Spark, Python, and SQL

  • Jupyter support for Spark, Python, R, and Scala


  1. Advanced Analytics and Data Science - (Has a graph analytics feature)

Data Science → Graph Analytics

  • GraphX:

    • Apache Spark API for graphs and graph parallel computing.

  1. Advanced Analytics and Data Science - (Supports streaming analytics like Spark Streaming)

Data Science → Spark Streaming

  • Spark Streaming is an extension of the Spark API core that enables scalable, high-throughput and fault-tolerant processing of live data streams. Data can be ingested from many sources such as Kafka, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions such as map, minimize, join, and window. Finally, the processed data can be sent to the file system, database, and live dashboard. In fact, you can apply machine learning and Spark's graph processing algorithms to data streams.

ETL Integration

  1.  ETL (Integration) - (Able to do file merging (file merging) and also file splitting (file sharding)

BigAction Capable of merging files and splitting files with existing processors already available in BigAction.


  1. ETL (Integration) - (Data Virtualization)

Data Virtualization (Virtualize RDMS data and files):

  • Integrates data from disparate sources without copying or moving data, giving users a single virtual layer spanning multiple applications, formats, and physical locations


  1. ETL (Integration) - Scalable and High Performance (ETL devices are capable of scaling-out services to serve increasing data processing needs)

  • NiFi Scaling Out:
    NiFi is designed to scale-out through the use of clustering many nodes together as described above. If a single node is provided and configured to handle hundreds of MB per second, then a simple cluster can be configured to handle GB per second. This then brings up the interesting challenge of load-balancing and fail-over between the NiFi and the system from which it gets the data. Using asynchronous queue based protocols like messaging services, Kafka, etc., can help. The use of NiFi's 'site-to-site' feature is also very effective because it is a protocol that allows NiFi and clients (including other NiFi clusters) to talk to each other, share information about loading, and to exchange data in certain authorities. Harbor.

  • NiFi Scaling Up/Down:
    NiFi is also designed to scale-up and down in a very flexible way. In terms of increasing throughput from the point of view of the NiFi framework, it is possible to increase the concurrent tasks on the processor under the Scheduling tab when configuring. This allows more processes to be executed concurrently, providing greater throughput. On the other side of the spectrum, you can scale-down NiFi perfectly to suit running on edge devices where a small footprint is desired due to limited hardware resources.


  1. ETL (Integration) -(Able to perform minimal ETL tasks such as data conversion, lookup, expression, joining records, splitting data, filtering, ranking, sorting and grouping)

To perform an effective ETL Task, the user must know the types of BigAction/NiFi processors required. These processors provide the ability to ingest data from multiple systems, route, transform, process, split, and aggregate data, and distribute data across multiple systems.

  • Allows you to perform data ingestion to pull data into NiFi, from multiple data sources and create flow files

  • It offers real-time control that helps you manage the movement of data between any source & destination

  • Visualizing DataFlow at the enterprise level

  • Provides common tools and extensions

  • Lets you leverage existing libraries and functionality of the Java ecosystem

  • Helping organizations to integrate Nifi with existing infrastructure

  • NiFi is designed to scale-out in groups offering guaranteed data delivery

  • Visualize and monitor performance, behavior in stream bulletin offering insights and inline documentation

  • Help you to start and stop components separately or at the group level

  • It helps you to listen, fetch, split, aggregate, route, transform, and drag & drop Dataflow


  1. ETL (Integration) - (Supports Object Reusability in ETL development so that the same transformation process can be reused without the need for re-development)

  • Can reuse (instance) one component as many times as you want in one flow (including inside and outside the process group).


  1. ETL (Integration) -(Has tools to develop modules or jobs and resume jobs when a failure occurs)

Components can be started by selecting all components to start with and then clicking the "Start" () button in the Operate Palette or by right-clicking a single component and selecting Start from the context menu.

When you start a Process Group, all components in that Process Group (including child Process Groups) are started, with the exception of components that are invalid or disabled.

Once started, the Processor status indicator will change to a Play() symbol

  1.  ETL (Integration) - (Has the ability to build a comprehensive lineage data from the beginning of data collection to the aggregation stage)

  • NiFi maintains a very granular level of detail about every piece of data it absorbs. As data is processed through the system and transformed, routed, shared, aggregated, and distributed to other endpoints, all of this information is stored in the NiFi Provenance Repository. To search and view this information, we can select Data Source from the Global Menu. This will give us a table listing the Provenance events we've been looking for

  • Inside Data Provenance there are Event Details and Lineage Graph features:

    • Event Details: Has 3 tabs namely Details, Attributes and Content

  • Details:
    Can see exactly when the event occurred, which FlowFile the event was affected by, which component (Processor, etc.) performed the event, how long the event lasted, and the time the entire data was in NiFi when the event occurred (total latency) .

  • Attributes:
    Can see all the Attributes that were in the FlowFile when the event occurred, as well as the previous values for those Attributes. This allows us to know which Attributes changed as a result of this event and how they changed

  • Content:
    This tab provides information about the Content Repository and the saved FlowFile Content. If an event changes the content of the FlowFile, it can see the 'before' (input) and 'after' (output) content claims. There is also an option to Download content or to View content within the NiFi itself, if the data format is NiFi compatible.

  • Lineage Graph: Lineage of FlowFile

  1. ETL (Integration) - (Integrated with Data Quality)

  2. ETL (Integration) - (Codeless, drag n drop pipeline designer)

  • Web-based User Interface

    • Seamless experience diantaranya design, kontrol, feedback, and monitoring

  • Can be configured

    • Loss tolerant vs guaranteed delivery

    • Low latency vs high throughput

    • Dynamic prioritization

    • Flow can be modified at runtime

    • Back pressure

  • Data Provenance

    • Track data flow from start to finish

  • Designed for extension

    • Build your own processors and more

    • Enables fast development and effective testing

  • Secure

    • SSL, SSH, HTTPS, encrypted content, etc...

    • Multi-tenant authorization and internal authorization/policy management


  1. ETL (Integration) - (Data Security (Masking and Encryption))

  • Encryption or Decryption of a FlowFile between using symmetric encryption with a raw key or password and a randomly generated salt, or using asymmetric encryption using a public and secret key.

  • Updates the contents of the FlowFile by evaluating the Regular Expression (regex) against it and replacing the portion of the content that matches the Regular Expression with some alternative value.

  1. ETL (Integration)- (Versioning and can be integrated with CI/CD)

  2. ETL (Integration) - (Monitoring and Alerting Facility for ETL jobs)

  • By using the Ambari Metrics and Grafana services, you have a way to easily and visually monitor NiFi performance. And you can also use Apache Ranger to centralize authorization management for multiple components (NiFi, Kafka, etc) in one place.

  • ETL Job Alert:

    • Apache NiFi provides several reporting tasks to support external monitoring systems such as Ambari, Grafana, etc. Developers can create custom reporting tasks or can configure built-in tasks to send NiFi metrics to external monitoring systems.


Copyright © 2021 BigBox. All rights reserved. Various trademarks held by their respective owners. Privacy Policy  |  Terms & Conditions