How we built a powerful retail analytics engine for 30TB big data churns
A leading big data and analytics firm in Pittsburgh, USA was facing data churns because of 30 TB of historical data and 17 billion transactional records. There was no method to the data madness as lack of standardized processes such as de-duplication had led to inconsistencies in data, hampered data quality, and created issues in removing noisy/missing data.
As a company working on predictive analytics with partners across industries such as retail, media and entertainment, life sciences and pharmaceuticals - the business model was to help companies put their data to work in powerful new ways. And, the key requirement to deliver all this was to have accurate master data - which was missing.
The company in its exciting growth phase was now worried on how to leapfrog the organization to the next level of business performance. And so, the search for a reliable technology partner with proven experience in building digital products for big data analytics began.
The word of our expertise in building digital products with emerging technologies reached the decision-makers and an introduction call was set-up. Team from Accion Innovation Center (AIC) was engaged and they provided several examples of work done by Accion in the digital transformation for technology companies and how Accion helped them scale with innovative products that use emerging technologies. We hit the problem head on and proposed an agile analytics engine that was robust, accurate and scalable.
The team began with the retail partners as the starting point and addressed the problem of cleansing the data feeds that were coming from PoS devices. This data had to be made accurate to conduct analytics on them. As an example to highlight the scale of duplication, the terms used for a unit of measurement (UoMs) such as Pack varied from being written as PK, Pk, PCK, pck, etc. The data had to be standardized and cleaned for many other similar UoMs. The front end was recommended to be ReactJS and the back-end was carefully engineered considering the best-in-class technologies suited for the type of business this client was in.
A recommendation algorithm and a single page application (SPA) was built with a UI that talks to the back-end services with RESTful APIs. The three modules that were designed and delivered included a 'Data Cleansing Service' that is a packaged spring boot uber-jar and contains the processing logic to recommend the UoM and provides the RESTful API's. The second module was a 'Data Cleansing UI' built on ReactJs SPA which allowed the logged in users to create search criteria to search item records, see their details and manually update the recommendations.
One of the major challenge addressed by Accion was failure of Extract Transform Load (ETL) jobs at Hadoop/Yarn and Spark Clusters. We identified that Linear analytical querying execution was blocking other jobs and it was also causing other unidentified failures. Batch machine learning pipelines were taking a long time to build cohorts, KPI Reports, and run campaigns.
The project then involved re-engineering optimization of data lake to the analytical engine and processing of data warehouse pipelines. Streaming and near real-time data pipeline was made highly concurrent with parallel execution paradigm. We also reduced manual intervention by leveraging CI / CD initiative.
Accion engagement has helped this tech company in building a next-gen analytical platform that delivers better business outcomes with insightful data based on accurate predictive modeling. There was 5X performance improvements by re-engineering entire data lake to analytical engine pipeline. Also, the highly concurrent, elastic, non-blocking, and asynchronous architecture we designed reduced runtime to ~8 hours from 30 hours previously (a saving of ~22 hours of over 70%) while processing 4.6 billion events.
Need help in tackling your Big-Data problems? Contact us.