toplogo
Sign In

KGLiDS: Semantic Abstraction and Automation Platform for Data Science


Core Concepts
KGLiDS is a scalable platform that abstracts data science artifacts, captures semantics, and enables automation for data discovery, cleaning, transformation, and AutoML.
Abstract
KGLiDS addresses the lack of systematic knowledge sharing in data science. The platform employs machine learning and knowledge graph technologies. It abstracts pipelines and datasets to capture their semantics efficiently. KGLiDS offers on-demand automation for data cleaning and transformation using GNN models. The LiDS graph construction interlinks datasets with pipeline graphs. Interfaces provide pre-defined operations for users to interact with the system effectively.
Stats
KGLiDS demonstrates significantly faster performance with lower memory usage compared to existing systems while maintaining accuracy.
Quotes
"Data scientists primarily work in isolation without exchanging knowledge." "KGLiDS combines dataset search and pipeline generation within a single framework." "KGLiDS enables automatic learning and discovery on open data science."

Key Insights Distilled From

by Mossad Helal... at arxiv.org 03-25-2024

https://arxiv.org/pdf/2303.02204.pdf
KGLiDS

Deeper Inquiries

How can KGLiDS improve collaboration among data scientists

KGLiDS can improve collaboration among data scientists by providing a platform for sharing and leveraging implicit knowledge and experiences embedded in data science artifacts. By abstracting the semantics of datasets, pipelines, and programming libraries using machine learning and knowledge graph technologies, KGLiDS enables data scientists to discover relevant information from colleagues' work without reinventing the wheel. This holistic approach allows for efficient knowledge exchange, reducing duplication of efforts, promoting best practices, and fostering a collaborative environment where data scientists can learn from each other's experiences.

What are the limitations of static code analysis in capturing pipeline semantics

The limitations of static code analysis in capturing pipeline semantics lie in its inability to capture all aspects of a pipeline's functionality accurately. Static code analysis is less accurate for dynamic languages like Python used in most data science pipelines because it does not execute the code but rather analyzes it based on syntax rules. This limitation makes it challenging to infer certain details such as return types or parameter names accurately from library calls within the code. Additionally, static analysis may struggle with complex control flows or interactions between different parts of the pipeline that are only evident during runtime execution.

How can KGLiDS contribute to advancing research beyond traditional data science practices

KGLiDS can contribute to advancing research beyond traditional data science practices by introducing a new paradigm focused on semantic abstraction, linking, and automation of data science processes. By employing machine learning techniques and knowledge graphs to capture the semantics of datasets and pipelines at scale, KGLiDS opens up possibilities for more efficient data discovery, cleaning, transformation tasks through automated recommendations based on learned patterns from existing artifacts. This shift towards leveraging implicit knowledge contained in artifacts promotes innovation by enabling faster experimentation with new ideas while building upon established best practices captured within the system's knowledge base.
0