Data management and analytics functions | Project & components | BIGBOX Data Platform |
---|---|---|
Distributed batch processing of large data sets | Apache Hadoop | |
Database for unstructured & structured data storage of large tables | Apache Hbase +conn, +indx | |
Reliably store very large files across machines in a large cluster. | HDFS | |
Data warehouse summarization & ad hoc querying | Apache Hive | |
Metadata store for Hive tables | Hive Metastore (HMS) | |
Workflow scheduler to manage Hadoop jobs | Apache Oozie | |
Columnar storage format for Hadoop ecosystem | Apache Parquet | |
Fast compute engine for ETL, ML, DL, stream processing | Apache Spark | |
Integration capabilities with various types of DBMS including MySQL, Microsoft SQL Server, Microsoft Oracle, PostgreSQL, MongoDB, and DWH Microsoft APS (Analytics Platform System) | BigAction, Apache Nifi | |
Bulk data between Hadoop and structured datastores (Ex : MySQL, Microsoft SQL Server, Oracle, PostgreSQL, MongoDB and others) | Apache Sqoop | |
Job scheduling and cluster resource management | YARN | |
Coordination service for distributed applications | Apache Zookeeper | |
Store and manage large data sets across a cluster | Apache Accumulo | |
Metadata management, data lineage, governance & data catalog | Apache Atlas | |
Migrate and replicate data to other ecosystem | Apache Ranger | |
Smallest, fastest columnar storage for Hadoop | Apache ORC | |
Data-flow framework for batch, interactive use-cases | Apache Tez | |
Fast analytical queries on event-driven data | Apache Druid | |
Perimeter security governing access to Hadoop | Apache Knox | |
Cryptographic key | Ranger KMS | |
Notebook for interactive analytics | Apache Zeppelin | |
Distributed processing for stateful computations | Apache Flink | |
Data serialization system | Apache Avro | |
Provisioning, managing, and monitoring Apache Hadoop clusters | Ambari | |
SQL workbench for data warehouses | Hue | |
Distributed MPP SQL query engine for Hadoop | Impala | |
Column-oriented data store for fast data analytics | Apache Kudu | |
Enterprise search & index platform | Apache Solr | |
Real-time streaming data pipelines, analytics, and apps | Apache Kafka, Spark Streaming, Storm, Flink | |
Distributed object store for Hadoop | Apache Ozone | |
Scalable directed graphs of data routing, transformation, and system mediation logic | BigAction, Apache Nifi | |
Interactive collaborating and sharing analytics tools | Jupyter | |
Manage resource and data security across the Hadoop ecosystem | Apache Ranger | |
Integrate with various format, data source, protocol and other enviroment | BigAction, Apache Nifi | |
Strong authentication for client/server application | Kerberos | |
Keeping track of the running applications | Ambari | |
Distributed realtime computation system | Apache Storm | |
Data scraping and crawling enginee | BigSpider | |
Faceted search enginee | BigSearch | |
The Enterprise Business Intelligence for All Your Needs | BigQuery | |
Powerful Business Dashboard Software for Everyone | BigBuilder | |
API Management platform | BigEnvelope | |
Build Evaluation Model, Ex : Confusion Matrix, Cross Validation, AUC | Apache Spark, Jupyter, Zeppelin | |
Run CDC (Change Data Capture) Method | BigAction, Apache Nifi | |
Data migration and replication between Data Centers | Apache Hadoop | |
Interoperable with other big data ecosystems | BigAction, Apache Nifi | |
Capability to scaling out / patches with auto-configuration features without causing service downtime that not impact on the ecosystem | HDFS | |
Capability to develop modules or jobs and job resumes when a failure occurs | YARN | |
Provides a data security module including authentication, access authorization, audit trail, data masking, encryption and others | Ranger, Knox, Kerberos | |
Data security management not only on data at rest, but also on data in motion | Ranger, Knox, Kerberos | |
Storage object feature capabilities by supporting the S3 protocol | Apache Ozone | |
Build a comprehensive data lineage from the start of data collection to the aggregation stage | Apache Atlas | |
Integrated capabilities with LDAP or Active Directory authentication services | Kerberos | |
Manage the allocation of computing resources based on user or group (Ex : CPU, Storage, Memory Requirement, etc) | YARN | |
Easy to use ETL tools with GUI designer workflow | BigAction, Apache Nifi | |
Ability to keep running when one of the nodes is not functioning (Fault Tolerance) | HDFS | |
Scaling out capabilities to serve increasing data processing needs | BigAction, Apache Nifi | |
ETL supports parallel processing of big data frameworks including MapReduce, Spark, Storm, Flink, Tez and others | BigAction, Apache Nifi | |
Supports Object Reusability in ETL development | BigAction, Apache Nifi | |
Unlimited ETL's user | BigAction, Apache Nifi | |
Supporting data analytics, collaborative development & modelling with Machine Learning and Deep Learning with Python, R programming and others | Spark, Jupyter, Zeppelin | |
Analytics platform that can be integrated with big data clusters in the context of authentication, authorization, and resource management | Spark, Ranger | |
sharing, publishing, and collaborating on data analytics projects | Spark, Jupyter, Zeppelin | |
Supports parallel processing of big data frameworks including MapReduce, Spark, Jupyter, Zeppelin | Mapreduce, Jupyter, Zeppelin, Spark | |
Data analysis capabilities, training models, deployment models in the form of APIs, and collaboration facilities | Spark, Jupyter, Zeppelin | |
Capability to monitor the machine learning models that have been deployed | Spark, Jupyter, Zeppelin | |
Unlimited Analytics Platform's users | Spark, Zeppelin, Jupyter | |
The analytics platform can interact from the Hive, HDFS and Solr data sources | Hive, HDFS, Solr, Spark, Jupyter, Zeppelin | |
The analytics platform can integrate with Oozie, YARN and HDFS Browser | Oozie, YARN, HDFS Browser, Spark, Jupyter, Zeppelin | |
Wizard features to configure landing zone, integration zone, analytics zone and others | BigLake | |
Export data capability from hive/impala/kudu to target database | BigAction | |
Process data both batch and realtime/streaming (publish-subscribe messaging) | BigAction, Kafka | |
Multidimensional data storage capability based on OLAP | Apache Druid | |
Ability to process graph data structure and run graph analytics | BigLake, Giraph | |
facilitates full-text search capability with many option database | Apache SOLR, Elastic Search | |
ACID compliant (insert-update-select process against one table simultaneously) | BigLake | |
Distributed storage for real time analytics | BigLake | |
Retrieval and utilization of SQL and NoSQL based data | BigLake | |
Data lineage tracking, both within the datalake and from the data source | BigLake | |
AI/Machine Learning is easy to upgrade or integrate with other libraries/tools | BigLake | |
Interactions with multiple data sources including Spark and Impala | BigBuilder | |
Capability to build analytical data models (Ex : descriptive analytics, predictive analytics, prescriptive analytics, path analytics, text analytics and others) | Spark, Jupyter, Zeppelin | |
Perform MLOps including development flow, operations, monitoring of deployed models on machine learning models that have been deployed | Spark, Jupyter, Zeppelin | |
Ability to process graph data structure and run graph analytics | BigLake, Giraph | |
Graph data processing and analysis can use the installed platform with separate size and virtual nodes without reducing the minimum usable capacity | BigLake, Giraph | |
Accessibility (Read and Write) by ETL tools such as InnoQuartz, and others or data integration solutions such as Talend, and others | BigLake, HDFS | |
Easy self service analytics tools to upload, explore data, analyze data, and create visualizations with chart or dashboard | BigQuery, BigBuilder | |
Able to extract and transform data using different methods in an integrated platform, without writing a line of code/script | BigAction | |
Supports correlation of data from different data sources to generate new data sets, without ever writing a line of code/script | BigAction, BigBuilder | |
Support data aggregation schemes (calculation, summation, average, search for maximum and minimum values and others) without writing a line of code/scripting | BigAction, BigQuery, BigBuilder | |
Support execution of one query set in one operation to generate several new data sets, without writing a line of code/script (Single Query Multiple Result) | BigAction, BigBuilder, BigEnvelope | |
Artificial Intelligence (AI) text processing using Indonesian-based Natural Language Processing | BigLake, BigSearch | |
Internet crawler and media analytics support in one system, can be used to perform sentiment analysis and can determine keywords from the information to be retrieved without any restrictions on the number of keywords. | BigSpider | |
Supports export data to other systems using API. The resulting Chart and Dashboard can be accessed from a web browser using the generated url. | BigEnvelope, BigQuery, BigBuilder | |
Can process data aggregation of 1MB/sec/cpu thread, and can optimize query with CPU usage can reach more than 90% | BigLake | |
Support data processing & storage components with “Shard and Replica” and high availability | BigLake, HDFS | |
Distributed Parallel Processing, and supports the implementation of multi-node processing that can be configured with a Masterless Cluster configuration | BigLake | |
Can process data in gzip format (compressed data) without the ever writing a line of code/scripting | BigLake | |
Support data virtualization schema (queried directly at the origin) | BigBuilder |