Laboratory of Soft Skill 1
Workshop in Master Program, Université Claude Bernard Lyon 1, Math Department, 2024
Available projects :
Survey on French (Large) Language Models in Healthcare
For this assignment, you are tasked with conducting a detailed survey of the current state of the art in language models designed for processing French in the healthcare domain. Explore a range of models, from large-scale general-purpose models like GPT or BLOOM to more specialized architectures such as BERT variants (e.g., CamemBERT or FlauBERT) and domain-specific adaptations (e.g., BioBERT or models fine-tuned on medical corpora). Investigate how these models perform on healthcare-related tasks such as medical text classification, named entity recognition (NER) for clinical terms, or question-answering for patient data. Discuss the datasets used for training or fine-tuning these models, evaluation benchmarks, and how well these tools address challenges in healthcare contexts, such as handling sensitive data or specialized terminology. Your report should identify current gaps and propose potential future directions for research.
Survey on Data Quality for Graphs
In this assignment, you are required to conduct a comprehensive survey on the current state of the art in data quality methodologies for graph data. Your research should focus on techniques for ensuring consistency (e.g., constraints like functional dependencies, path constraints) and methods for repairing inconsistent or incomplete graph data. Consider the application of these methodologies across different graph data models, including labeled graphs, property graphs, and RDF (Resource Description Framework). Highlight the unique challenges and solutions associated with each model, such as handling schema heterogeneity in RDF or edge-labeling issues in property graphs. Discuss evaluation metrics, tools, and frameworks used to assess and improve data quality in graph-based systems, and identify potential research gaps or emerging trends in this domain.
Investigate the current status of the Resource Track at ISWC
For this assignment, you are tasked with investigating the current status of works published in the ISWC (International Semantic Web Conference) Resource Track. Your survey should focus on assessing whether the resources (e.g., datasets, ontologies, tools, benchmarks) are still available online, their level of maintenance or updates, and their current usage in research or industry. Analyze metrics such as citations, downloads, or integration into other projects to gauge their impact. Additionally, explore the factors that contributed to the success or lack of adoption of these resources, such as usability, documentation quality, community engagement, or alignment with evolving technological trends. Your report should provide insights into best practices for ensuring the longevity and relevance of resources in the Semantic Web community Survey/Empirical
Survey on Causal Inference Analysis in Data Management
For this assignment, you are tasked with conducting a survey on the current state of the art in causal analysis within the field of data management. Focus on how causal analysis is utilized, such as for explainability, decision support, or predictive modeling, and evaluate the extent to which current research enables causal analysis in data management systems. Investigate methodologies, frameworks, and tools that integrate causal inference into traditional data management workflows. Discuss challenges in scaling causal techniques, ensuring accuracy, and integrating them with existing systems, as well as the practical impact of these advancements on real-world applications. Identify gaps in research and propose potential directions for making causal analysis a more integral part of modern data management.
Survey of LLM in Data Management
For this assignment, you are tasked with conducting a survey on the current state of the art in using large language models (LLMs) for data management tasks. Explore how LLMs, such as GPT, Codex, or specialized models, are applied in areas like data cleaning, schema mapping, query generation, natural language-to-SQL translation, metadata extraction, and semantic enrichment. Evaluate the strengths and limitations of these models, including their scalability, accuracy, and adaptability to diverse data management scenarios. Discuss benchmarks and datasets used for evaluation, as well as the challenges of integrating LLMs into existing systems. Your report should highlight key advancements, practical use cases, and open research questions in this emerging field.
Survey on Data Analysis in E-Sport
For this assignment, you are tasked with conducting a survey on the current state of the art in data science and machine learning technologies used in e-sports. Explore how these technologies are applied to areas such as player performance analysis, game strategy optimization, matchmaking systems, cheat detection, and audience engagement. Investigate specific machine learning models, data analytics tools, and frameworks used in the e-sports industry, highlighting their impact on competitive gaming and community experiences. Discuss challenges such as real-time data processing, fairness in matchmaking, and the interpretability of AI models in gaming contexts. Your report should provide insights into recent advancements, key use cases, and opportunities for further innovation in e-sports powered by data science and machine learning.
Survey on Data Analysis in Sports
For this assignment, you are tasked with conducting a survey on the current state of the art in data science and machine learning technologies applied in sports. Explore how these technologies are used for performance analytics, injury prediction and prevention, team strategy optimization, fan engagement, and talent scouting. Investigate the use of specific machine learning models, wearable technology data, computer vision systems, and advanced analytics platforms in various sports. Discuss the challenges faced in real-time data processing, model accuracy, and ethical considerations like data privacy. Your report should highlight recent advancements, key applications, and potential future directions for data science and machine learning in the sports industry. Survey Evaluate LLMs performance in reasoning over structured data For this assignment, you are tasked with evaluating the performance of different large language models (LLMs) in reasoning over structured (tabular) data. You will design an experiment by selecting a set of LLMs to compare, defining the type of reasoning tasks to test (e.g., aggregation, filtering, join reasoning, or trend analysis), and identifying or creating tabular datasets for evaluation. Choose appropriate metrics to assess performance, such as accuracy, reasoning completeness, response time, or interpretability. Clearly document the experiment setting, including the prompts used, the evaluation criteria, and any preprocessing steps. Your report should analyze the strengths and weaknesses of each LLM in handling structured data and provide insights into their potential and limitations for tabular reasoning tasks.
Surveying IoT Applications for Environmental Monitoring
For this assignment, you are tasked with conducting a survey on the current state of the art in IoT (Internet of Things) applications for environmental monitoring. Investigate the types of sensors commonly used (e.g., air quality sensors, water quality sensors, temperature and humidity sensors) and the variety of data they collect. Explore how this data is analyzed, including techniques for real-time monitoring, predictive modeling, and anomaly detection, as well as how it contributes to addressing environmental challenges such as pollution control, climate change, and natural resource management. Your report should also discuss the integration of IoT systems with advanced technologies like machine learning and evaluate the challenges and limitations faced in this domain, such as energy efficiency, data privacy, and scalability.
Surveying Crowdsensing for Environmental Applications
For this assignment, you are tasked with conducting a survey on the current state of the art in crowdsensing for environmental applications. Investigate how crowdsensing—where data is collected by individuals or devices in a distributed manner—is being used to monitor and address environmental challenges. Explore the types of data collected (e.g., air quality, noise levels, water pollution) and the technologies employed, such as mobile devices, wearable sensors, and IoT-enabled systems. Examine the methods used for data aggregation, validation, and analysis, as well as the role of community engagement and incentivization in ensuring data quality and participation. Your report should also discuss challenges such as data privacy, scalability, and reliability, while identifying successful use cases and potential areas for future research.
Evaluating LLM Performance in Converting PDF Content to Table Format
For this assignment, you are tasked with evaluating the performance of large language models (LLMs) in converting unstructured content from PDFs into structured table formats. Select a diverse set of PDFs containing text-based information that can logically be organized into tables, such as financial reports, schedules, or research summaries. Define the evaluation criteria, including the accuracy of data extraction, the preservation of relationships between elements, and the correctness of the generated table structure. Analyze how different LLMs handle various challenges, such as complex layouts, multi-column text, and implicit data hierarchies. Your report should document the experiment setup, highlight the strengths and limitations of the LLMs tested, and provide recommendations for improving their performance in such tasks.
