Leveraging Open Source Tools for Analytics in ... - ACS Publications

Chapter 4. Leveraging Open Source Tools for Analytics in Education Research. Sindhura Elluri*. Department of Computer and Information Technology, Knoy...
0 downloads 7 Views 213KB Size
Chapter 4

Leveraging Open Source Tools for Analytics in Education Research Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

Sindhura Elluri* Department of Computer and Information Technology, Knoy 372, Purdue University, West Lafayette, Indiana 47906, United States *E-mail: [email protected].

Computer-based assessment has gained a lot of importance in the last couple of years. The need to identify the gaps and factors affecting educational processes and settings better has become vital. This chapter discusses the growing importance of data analytics in the domain of education research to identify the parameters effecting the students’ academic performance and conceptual understanding. It also gives an overview of general procedure followed in conducting data analysis and different open source tools available for both quantitative and qualitative research.

Introduction A cyclical process of steps that typically begins with identifying a research problem or issue of study. It then involves reviewing the literature, specifying a purpose for the study, collecting and analyzing data, and forming an interpretation of information. This process culminates in a report, disseminated to audiences, that is evaluated and used in the educational community (1). The basic educational research process involves identifying the problem to be addressed, identifying the data collection methods to collect data required for analysis, analyze the data, making inferences from the data to identify the supporting theory for problem being addressed and suggest measures to address this gap. Data handling is a vital part of educational research. This includes processes of inspecting, cleansing, transforming, and modeling data so that then in can be properly analyzed with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Then, traditional educational data © 2017 American Chemical Society Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

analysis techniques in educational research can be performed, such as qualitative, quantitative, and statistical analyses; or more recent ones at the intersection of machine learning and education, such as learning analytics and educational data mining. Quantitative/Statistical analysis refers to the numerical representation and manipulation of observations for the purpose of describing and explaining the phenomena that those observations reflect .It involves the techniques by which researchers convert data to numerical format and subject them to statistical analysis to test the research hypothesis.Qualitative analysis refers to development of concepts which help us to understand social phenomena in natural settings, giving emphasis to the meanings, experiences and views of the participants. With increase in size and complexity of the data to be analyzed, researchers are more interested in an automated method for discovery of patterns from the data. A confluence of advances in the computer and mathematical sciences has unleashed an unprecedented capability of conducting a data intensive research. Data mining as well as machine algorithms and natural processing techniques are slowly being adopted into education research. Modern machine learning and natural language processing techniques can analyze data and identify how students learn, what students know, and furthermore, what they do not know.

Types of Data Data collected for analysis in educational research has multiple formats based on how it is being collected and stored. Data can be broadly classified into structured, semi-structured and unstructured data. Structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations. It is generally stored in the form of relational databases. Semi-structured data refers to form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless, contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. XML and JSON are forms of Semi-structured data. One of the examples of this from education research would be recording the students’ action in the JSON log when they are trying to interact with a learning tool. Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Text files and spreadsheets are forms of unstructured data. One of the examples of this from education research is think-aloud/interview. Most of the data collected in educational research falls under the category of semi-structured or unstructured data. The tools and methods used for analysis vary depending on the format of the data, the size of the data, how the data is stored, and what type of analysis is being used. Some of the open source data analysis tools 40 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

like R and Python support any type of data. The data is also analyzed manually first by assigning scores or values to convert it into numerical data and then perform statistical analysis for testing the hypothesis.

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

Overview of Data Analysis Processes Data analysis is a sequential process. Figure 1 depicts the general process of data analysis.The most essential component of the data analysis process starts with identifying what questions need to be answered from the data and collecting the relevant data. The method of data collection in education research is diverse. Different researchers employ different techniques for data collection based on the issue they are trying to address. Some researchers explicitly design their own instruments to collect data based on the hypothesis being tested. Some of the most common forms of data collection include questionnaires, interview, observation, records, tests, or surveys, among others. The next step is to organize the data collected for analysis. This includes data cleaning and data integration. If data is collected from multiple sources and each of the sources have a different format, the data from these sources must be integrated to form a single data set with unique format to make it easier for the analysis. Data is not always perfect and can present multiple problems which need to be identified and corrected in order to ensure accurate results. The decision of what part of the data needs to be retained and what parts of the data need to be eliminated depends on the research context. In statistical analysis, the data needs to be checked for duplicates, outliers or missing values for analysis. In qualitative analysis which uses text from interviews, it needs to be checked for mistyped words as well the content irrelevant to the research needs to be identified and removed. In a research context which uses text analytics, the singular words irrelevant to the research context or for analysis are marked as stop words and are removed from the data. An exploratory data analysis can provide additional insights enabling researchers to make decisions regarding any further data cleaning or preprocessing and data modeling. The next step is to identify an approach for data analysis depending on the type of research and identify the right software required for analysis. Quantitative data analysis includes classifying the features and constructing statistical models to explain what is observed. Statistical methods are often used in for quantitative data analysis. It uses univariate, bivariate and multivariate variables for analysis depending on number variables being used in the analysis to answer the research question. Identifying the right statistical test is crucial in statistical data analysis. Some of the statistical methods used in the analysis based on research context include but are not limited to checking the distribution of the data, Factor Analysis, Hypothesis testing , regression, T-test, Anova, correlation cluster analysis, and so on. Qualitative data analysis has two different approaches: (a)The deductive approach, which uses research questions to group the data and then look for similarities and differences in the data, and (b)The inductive approach, which uses an emergent framework to group the data and then look for relationships 41 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

Figure 1. Summary of the general process of data analysis.

42 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

Qualitative analysis involves identifying recurrent themes and patterns in the data, clustering the data to identify related themes and developing a hypothesis to test the data. Traditional qualitative analysis involves researcher doing the analysis manually which is labor-intensive. The use of software tools provides the researchers with the flexibility to use any format of data like text, picture, audio, and video. There are different machine learning, data mining and natural language techniques which can help researchers identify themes and patterns in data an build predictive models. The software tools often have inbuilt libraries which can be leveraged to perform analysis using these algorithms. The results of the algorithm should be carefully investigated in order to answer the research questions.The summary of the data is often presented in the form of visualizations. It enables decision makers to grasp difficult concepts or identify new patterns in the data based on the visualizations. The analysis cycle is iterative. Based on the results, the data analysis cycle chain can be repeated with a different data set or a different data model to identify which gives better results in order to justify the hypothesis or to be able to answer the research question in the context.

Choosing the Right Software for Analysis The selection software tool used for analysis depends on the researcher’s personal inclination. There are multiple open source tools which could be more efficient in performing analysis the researcher intends to do. The parameters which govern the decision of which software to choose is different for quantitative and qualitative data analysis. Qualitative Analysis Software tools are used in quantitative data analysis to transcribe the data, to code/tag the data, to search and retrieve the data and to identify pattern or relations in the data. As described in the book “Computer Programs for Qualitative Data analysis” (2), below are the key questions which need to be assessed before choosing a particular software for data analysis: 1) 2) 3) 4) 5) 6)

Type of the data and size of the data Theoretical approach to analysis Time required to learn the software and time required for analysis Identify the depth of analysis required: Simple or detailed Desired quantification the results Software support available in case of any issues

Quantitative Analysis Software tools are used in quantitative analysis for statistical modeling and data wrangling. Most of the parameters are similar to what have been described for qualitative analysis above. Below are some of the key factors which need to be assessed before choosing particular software for analysis: 43 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

1) 2) 3) 4)

If it allows the user to perform the intended statistical test Size of the data Time required to learn the software and time required for analysis Programming capability of the researcher and the amount of programming required by the software to perform the action. 5) Software support available in case of any issues

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

Open Source Tools Open source refers to a program or software in which the source code is available to the general public for use and/or modification from its original design free of charge.The open source software can be modified to incorporate any additional functionality as required by the users. It provides an open platform for the researchers to contribute ideas for the development of reusable packages which will be useful for their research and can also, be used by other researchers who require that type of analysis in their research. Many researchers in the domain of education have started using different software tools for their data analysis. In Table 1 below are some of the open sources software tools which can be used for quantitative/statistical research, qualitative research and data visualization in education research:

Sample Case Study Using Apache Tools With the increase in the size of data to be analyzed, big data has become a ubiquitous term in Learning analytics as well. Apache tools are very popular for big data analysis.They are easy to learn and use. Apache Drill is one of the many Apache tools which offers the flexibility of data transformation, querying and visualization. It allows the users to analyze without having to define complex schema, or having to rebuild their entire data infrastructure. Drill is a boon in disguise for anyone that relies on SQL to make meaningful inferences from data sets. Another advantage of Drill is that it does not require schema or type specification for data to start the query execution process. Drill starts data processing in record-batches and discovers the schema automatically during processing. Self-describing data formats such as Parquet, JSON, AVRO, and NoSQL databases have schema specified as part of the data itself, which Drill leverages dynamically at query time. Another exclusive feature of Apache Drill is Drill Explorer. Drill Explorer is a user interface for browsing Drill data sources, previewing the results of a SQL query, and creating a view and querying the view as if it were a table. Drill explorer helps you to examine and understand the metadata in any format before querying or designing views, which are used to visualize data in BI/Analytics tools like Tableau. It allows the user to explore structure, size, content of data in any format. 44 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Table 1. Open Sources Software Tools

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

Open Source Software Tools

Description/In-built Packages

Functionality Supported

1

R

An active open source project that has numerous packages available to perform any type of statistical modeling.

Exploration, Visualization, Analysis(Qualitative & Quantitative), Data Wrangling

2

Python

Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. It has libraries like Pandas,Numpy,Statsmodels,scikitlearn, NLTK, matplotlib to support data analysis and visualization

Exploration, Visualization, Analysis(Qualitative & Quantitative), Wrangling

3

Wrangler

Interactive tool for data cleaning and transformation into data tables which can be exported into Excel and other tools for analysis

Data Wrangling

4

Apache Drill

An open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets

Exploration, Data cleaning and Transformation, Querying, Visualization

5

Weka

Collection of visualization tools and algorithms for data analysis and predictive modeling with GUI for easy access

DataPreprocessing,Visualizati-on,clustering,classification, Regression

6

AQUAD

Content analysis tool which supports search and retrieve

Text Analytics, Coding

8

Data Applied

Online tool for data mining and visualization supporting multiple analytical tasks including Time series forcasting,correlation analysis,descision trees, clustering

Analysis, Visualization

The Apache Drill Explorer window has two tabs: browser tab and SQL tab. The Browse tab lets the user view any existing metadata for a schema that you access with Drill. SQL tab allows the user to preview the results of custom queries and save the results as a view. Drill is extensible and can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions thus mitigating ETL(Extraction-transformation-loading) process to combine data from multiple sources to perform the required analysis. We need to add required storage plugins based on the available data sets in order to be able to explore these disparate datasets. 45 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

Below is a short tutorial of how to experiment with Yelp dataset using Apache Drill. The data set can be found at YELP (https://www.yelp.com/ dataset_challenge). The data is in JSON format. The term dfs. used in each of the queries below refers to the path where the data set is saved on a machine and it needs to be modified to include the path to yelp dataset on your machine when trying this example. The first step is to explore the data present in those JSON files using Drill. We can use SQL SELECT to view the contents of the JSON file. We can restrict the number of rows to be returned by using LIMIT in the SELECT statement. We need to provide the location of the JSON file in the SELECT statement in place of table name in a regular SQL query. We can directly query self-describing files such as JSON, Parquet, and text files. We can explore the review dataset further by examining specific columns corresponding to JSON file. We can also use aggregation functions like Sum in the SQL statements. We can view the attributes in Yelp business dataset by turning on the text mode in Drill. We need to turn off the text mode when trying to perform arithmetic operations on the dataset. We can use an alter statement to set the text mode in Drill alter system set ‘store.json.all_text_mode‘ = false; alter system set ‘store.json.all_text_mode‘ = true; Business users, analysts and data scientists use standard BI/analytics tools such as Tableau, Qlik View, Micro Strategy, SAS and Excel to interact with nonrelational data stores by leveraging Drill’s JDBC and ODBC drivers. Drill’s symmetrical architecture and simple installation make it easy to deploy and operate very large clusters. Drill is the world’s first and only distributed SQL engine that doesn’t require schemas. All of these features make Apache Drill most desirable tool for Data Analysis.

Conclusion There are several open source data analysis software tools can be leveraged directly to perform data analysis. With the increase in number of open source software tools available for use, the researchers’ should start using these tools to improve the efficiency of their research and automate several components of data analysis.

Acknowledgments I dedicate this chapter to my family & Emroz for their unwavering support and for believing in me. I am grateful to Dr. Alejandra Magana for all the motivation and support. 46 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

References 1.

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch004

2.

Creswell, J. W.. Educational research: planning, conducting, and evaluating quantitative; Prentice Hall: Upper Saddle River, NJ, 2002; Vol. 2, pp 24−25. Apache Drill - Schema-free SQL for Hadoop, NoSQL and Cloud Storage; https://drill.apache.org/ [accessed Sep 8, 2017].

47 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.