ارائه یک روش مبتنی بر یادگیری برای تخمین و ارزیابی کیفیت مجموعه داده¬های پیوندی
محورهای موضوعی : فناوری اطلاعات و ارتباطات
1 - عضو هیات علمی
کلید واژه: کیفیت داده, ارزیابی خودکار, داده های پیوندی, مدلهای یادگیری,
چکیده مقاله :
هدف اصلی داده¬های پیوندی، تحقق وب معنایی و استخراج دانش از طریق پیوند دادن داده¬های موجود روی وب می¬باشد. یکی از موانع دستیابی به این هدف، وجود مشکلات و خطاها در داده¬های منتشر شده است که باعث ایجاد پیوندهای نادرست و درنتیجه استنتاج¬های نامعتبر می¬گردد. با توجه به اینکه کیفیت داده¬ها تأثیر مستقیم بر موفقیت پروژه داده¬های پیوندی و تحقق وب معنایی دارد، بهتر است تا کیفیت هریک از مجموعه¬های داده در مراحل اولیه انتشار ارزیابی شود. در این مقاله، یک روش مبتنی بر یادگیری برای ارزیابی مجموعه داده¬های پیوندی ارائه می¬شود. برای این منظور، ابتدا مدل کیفیت مبنا انتخاب شده و ویژگی های کیفی مدل به حوزه مورد مطالعه (که دراین مقاله حوزه داده های پیوندی است) نگاشت داده می¬شود. سپس، براساس نگاشت انجام شده، ویژگی های کیفی مهم در حوزه مورد مطالعه شناسایی شده و با تعریف ویژگی های فرعی، بصورت دقیق توصیف می¬شوند. در مرحله سوم، براساس مطالعات گذشته، سنجه های اندازه گیری هریک از ویژگی های فرعی استخراج شده و یا تعریف می شوند. سپس، سنجه های اندازه گیری باید براساس نوع داده ها در دامنه مورد مطالعه پیاده سازی شوند. در مرحله بعد، با انتخاب چند مجموعه داده، مقادیر سنجه ها بصورت خودکار روی مجموعه داده های مورد آزمایش، محاسبه می شوند. برای استفاده از روشهای یادگیری باناظر، لازم است کیفیت داد ها بصورت تجربی توسط افراد خبره ارزیابی شود. در این مرحله، میزان دقت هریک از مجموعه¬های داده توسط افراد خبره ارزیابی می¬شود و برمبنای آزمون¬های مطالعه همبستگی، رابطه بین مقادیر کمی سنجه¬های پیشنهادی و میزان دقت داده ها مورد بررسی قرار می¬گیرد. سپس با بهره¬گیری از روش¬های یادگیری، سنجه¬های مؤثر در ارزیابی دقت که قابلیت پیش¬بینی قابل قبولی دارند، شناسایی می¬شوند. در پایان، با بهره¬گیری از روش¬های یادگیری، یک مدل پیش¬بینی کیفیت برمبنای سنجه¬های پیشنهادی ارائه ¬شده است. نتایج ارزیابی¬ها نشان داد که روش پیشنهادی علاوه بر خودکاربودن، مقیاس¬پذیر، کارا و کاربست پذیر است.
The main purpose of linked data is to realize the semantic web and extract knowledge through linking the data available on the web. One of the obstacles to achieving this goal is the existence of problems and errors in the published data, which causes incorrect links and as a result, invalid conclusions. Considering that the quality of the data has a direct effect on the success of the linked data project and the realization of the semantic web, it is better to evaluate the quality of each of the data sets in the early stages of publication. In this paper, a learning-based method for evaluating linked datasets is presented. For this purpose, first, the base quality model is selected and the quality features of the model are mapped to the field under study (which is the field of linked data in this article). Then, based on the mapping done, the important qualitative features in the study area are identified and described in detail by defining sub-features. In the third stage, based on past studies, the measurement metrics of each of the sub-features are extracted or defined. Then, measurement metrics should be implemented based on the type of data in the studied domain. In the next step, by selecting several data sets, the metric values are automatically calculated on the tested data sets. To use observational learning methods, it is necessary to evaluate the quality of data experimentally by experts. At this stage, the accuracy of each of the data sets is evaluated by experts, and based on the correlation study tests, the relationship between the quantitative values of the proposed metrics and the accuracy of the data is investigated. Then, by using learning methods, the effective metrics in the accuracy evaluation that have an acceptable predictability are identified. In the end, using learning methods, a quality prediction model based on the proposed criteria is presented. The results of the evaluations showed that the proposed method is scalable, efficient and applicable in addition to being automatic.
1. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J. and Auer, S. Quality assessment for linked data: A survey. Semantic Web. 2016. 7 (1), p.63-93.
2. Chen, P. and W. Garcia. Hypothesis generation and data quality assessment through association mining. in Cognitive Informatics (ICCI), 2010 9th IEEE International Conference on. 2010. IEEE.
3. Hogan, A., A. Harth, A. Passant, S. Decker, and A. Polleres. Weaving the pedantic web. in 3rd International Workshop on Linked Data on the Web (LDOW2010). 2010. Raleigh, North Carolina.
4. Fürber, C. and M. Hepp, Using semantic web resources for data quality management, in Knowledge Engineering and Management by the Masses. 2010, Springer. p. 211-225.
5. Hartig, O. and J. Zhao, Using Web Data Provenance for Quality Assessment. SWPM, 2009. 526.
6. Lei, Y., A. Nikolov, V. Uren, and E. Motta. Detecting Quality Problems in Semantic Metadata without the Presence of a Gold Standard. in 5th International EON Workshop at International Semantic Web Conference (ISWC'07). 2007. Busan, Korea.
7. Brüggemann, S. and F. Grüning, Using ontologies providing domain knowledge for data quality management, in Networked Knowledge-Networked Media. 2009, Springer. p. 187-203.
8. Bizer, C., T. Heath, and T. Berners-Lee, Linked data-the story so far. International journal on semantic web and information systems 2009. 5 (3): p. 1-22.
9. Behkamal, B., M. Kahani, S. Paydar, M. Dadkhah, and E. Sekhavaty. Publishing Persian linked data; challenges and lessons learned. in 5th International Symposium on Telecommunications (IST). 2010. IEEE.
10. Madnick, S.E., R.Y. Wang, Y.W. Lee, and H. Zhu, Overview and framework for data and information quality research. Journal of Data and Information Quality (JDIQ), 2009. 1(1): p. 2.
11. ISO, ISO/IEC 25012- Software engineering - Software product Quality Requirements and Evaluation (SQuaRE), in Data quality model. 2008.
12. Naumann, F. and C. Rolker. Assessment methods for information quality criteria. in 5'th Conference on Information Quality 2000. Cambridge, MA.
13. Jarke, M. and Y. Vassilion. Data warehouse quality: A review of the DWQ project. in 2nd Conference on Information Quality. 1997. Cambridge, MA.
14. Wang, R.Y., A product perspective on total data quality management. Communications of the ACM, 1998. 41(2): p. 58-65.
15. Naumann, F., U. Leser, and J.C. Freytag, Quality-driven integration of heterogeneous information systems, in 25th International Conference on Very Large Data Bases (VLDB'99). 1999: Edinburgh, Scotland, UK. p. 447-458.
16. Chen, Y., Q. Zhu, and N. Wang, Query processing with quality control in the World Wide Web. World Wide Web, 1998. 1(4): p. 241-255.
17. Tate, M.A., Web wisdom: How to evaluate and create information quality on the web. Second ed. 2010: CRC Press.
18. Kahn, B.K., D.M. Strong, and R.Y. Wang, Information quality benchmarks: product and service performance. Communications of the ACM, 2002. 45(4): p. 184-192.
19. Shanks, G. and B. Corbitt. Understanding data quality: Social and cultural aspects. in 10th Australasian Conference on Information Systems. 1999. Citeseer.
20. Dedeke, A. A Conceptual Framework for Developing Quality Measures for Information Systems. in 5th International Conference on Information Quality. 2000. Boston, MA, USA.
21. Helfert, M. Managing and measuring data quality in data warehousing. in World Multiconference on Systemics, Cybernetics and Informatics. 2001. Florida, Orlando.
22. Naumann, F. and C. Rolker. Do Metadata Models meet IQ Requirements? in Iternational Conference on Information Quality (IQ). 1999. Cambridge, MA.
23. Su, Y. and Z. Jin. A Methodology for Information Quality Assessment in Data Warehousing. in Communications, 2008. ICC'08. IEEE International Conference on. 2008. IEEE.
24. Wang, R.Y., D.M. Strong, and L.M. Guarascio, Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33.
25. Moraga, C., M. Moraga, A. Caro, and C. Calero. Defining the intrinsic quality of web portal data. in 8th International Conference on Web Information Systems and Technologies (WEBIST). 2012. Porto, Portugal.
26. Piprani, B. and D. Ernst. A model for data quality assessment. in On the Move to Meaningful Internet Systems: OTM 2008 Workshops. 2008. Springer.
27. Wand, Y. and R.Y. Wang, Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 1996. 39(11): p. 86-95.
28. Karr, A.F., A.P. Sanil, and D.L. Banks, Data quality: A statistical perspective. Statistical Methodology, 2006. 3(2): p. 137-173.
29. Lee, Y.W., D.M. Strong, B.K. Kahn, and R.Y. Wang, AIMQ: a methodology for information quality assessment. Information & management, 2002. 40(2): p. 133-146.
30. Pipino, L.L., Y.W. Lee, and R.Y. Wang, Data quality assessment. Communications of the ACM, 2002. 45(4): p. 211-218.
31. Knight, S.-A. and J.M. Burn, Developing a framework for assessing information quality on the World Wide Web. Informing Science: International Journal of an Emerging Transdiscipline, 2005. 8(5): p. 159-172.
32. Bobrowski, M., M. Marré, and D. Yankelevich, A Homogeneous Framework to Measure Data Quality, in International Conference on Information
Quality (IQ). 1999: Cambridge, MA. p. 115-124.
33 Gruser, J.-R., L. Raschid, V. Zadorozhny, and T. Zhan, Learning Response Time for WebSources Using Query Feedback and Application in Query Optimization. Very Larg Data base Journal, 2000. 9(1): p. 18-37.
34. Bagheri, E. and D. Gasevic, Assessing the maintainability of software product line feature models using structural metrics. Software Quality Journal, 2011. 19(3): p. 579-612.
35. Möller, K., M. Hausenblas, R. Cyganiak, and S. Handschuh, Learning from linked open data usage: Patterns & metrics. 2010.
36. Bizer, C., Quality Driven Information Filtering: In the Context of Web Based Information Systems. 2007: VDM Publishing.
37. Vapour online validator. Available from: http://validator.linkeddata.org/vapour.
38. Porzel, R. and R. Malaka. A task-based approach for ontology evaluation. in ECAI Workshop on Ontology Learning and Population, Valencia, Spain. 2004. Citeseer.
39. Lozano-Tello, A. and A. Gómez-Pérez, Ontometric: A method to choose the appropriate ontology. Journal of Database Management, 2004. 2(15): p. 1-18.
40. Brewster, C., H. Alani, S. Dasmahapatra, and Y. Wilks, Data driven ontology evaluation, in International Conference on Language Resources and Evaluation (LREC) 2004: Lisbon, Portugal. p. 24-30.
41. Brank, J., M. Grobelnik, and D. Mladenić, A survey of ontology evaluation techniques. 2005.
42. Tartir, S., I.B. Arpinar, M. Moore, A.P. Sheth, and B. Aleman-Meza. OntoQA: Metric-based ontology quality analysis. in IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources. 2005.
43. Gangemi, A., C. Catenacci, M. Ciaramita, and J. Lehmann. A theoretical framework for ontology evaluation and validation. in 2nd Italian Semantic Web Workshop. 2005. Italy.
44. Vrandečić, D., Ontology evaluation. 2009: Springer.
45. Ashraf, J., A semantic framework for ontology usage analysis, in School of Information Systems. 2013, Curtin University.
46. Maedche, A. and S. Staab, Measuring similarity between ontologies, in Knowledge engineering and knowledge management: Ontologies and the semantic web. 2002, Springer. p. 251-263.
47. Duque-Ramos, A., J.T. Fernández-Breis, R. Stevens, and N. Aussenac-Gilles, OQuaRE: A SQuaRE-based Approach for Evaluating the Quality of Ontologies. Journal of Research & Practice in Information Technology, 2011. 43(2).
48. Guarino, N. and C.A. Welty, An overview of OntoClean, in Handbook on ontologies. 2009, Springer. p. 201-220.
49. Antoniou, G. and F. Van Harmelen, Web ontology language: Owl, in Handbook on ontologies. 2004, Springer. p. 67-92.
50. Agre, J., M. Vassiliou, and C. Kramer, Science and Technology Issues Relating to Data Quality in C2 Systems. 2011, Institude for Defense Analyses (IDA). p. 26.
51. Umbrich, J., M. Hausenblas, A. Hogan, A. Polleres, and S. Decker, Towards dataset dynamics: Change frequency of linked open data sources. 2010.
52. Bizer, C. and R. Cyganiak, Quality-driven information filtering using the WIQA policy framework. Web Semantics: Science, Services and Agents on the World Wide Web, 2009. 7(1): p. 1-10.
53. Bohm, C., F. Naumann, Z. Abedjan, D. Fenz, T. Grutze, D. Hefenbrock, M. Pohl, and D. Sonnabend. Profiling linked open data with ProLOD. in Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on. 2010. IEEE.
54. Guéret, C., P. Groth, C. Stadler, and J. Lehmann, Assessing linked data mappings using network measures, in The Semantic Web: Research and Applications. 2012, Springer. p. 87-102.
55. Hogan, A., J. Umbrich, A. Harth, R. Cyganiak, A. Polleres, and S. Decker, An empirical survey of Linked Data conformance. Web Semantics: Science, Services and Agents on the World Wide Web, 2012. 14: p. 14-44.
56. Mendes, P.N., H. Mühleisen, and C. Bizer. Sieve: linked data quality assessment and fusion. in Proceedings of the 2012 Joint EDBT/ICDT Workshops. 2012. ACM.
57. Fürber, C. and M. Hepp. SWIQA–A Semantic Web information quality assessment framework. in ECIS 2011 Proceedings. 2011.
58. Hartig, O. Trustworthiness of data on the web. in Proceedings of the STI Berlin & CSW PhD Workshop. 2008. Citeseer.
59. Fenton, N.E. and S.L. Pfleeger, Software metrics: a rigorous and practical approach. 1.0 ed. 1998: PWS Publishing Co.
60. Batini, C. and M. Scannapieca, Data quality: concepts, methodologies and techniques. 1.0 ed. 2006: Springer.
61. Basili, V.R., G. Caldiera, and H.D. Rombach, The goal question metric approach, in Encyclopedia of software engineering. 1994, John Wiley & Sons. p. 528-532.
62. Behkamal, B., M. Kahani, E. Bagheri, and Z. Jeremic, A Metrics-Driven approach for quality Assessment of Linked open Data. Journal of Theoritical and Applied Electronic Commerce Research 2014. 9(2): p. 64-79.
63. Behkamal B., Bagheri E., Kahani M., and Sazvar M., Data accuracy: What does it mean to LOD?. in 4th International Conference on Computer and Knowledge Engineering (ICCKE). 2014. IEEE.
64. Behkamal, B. The code of metrics calculation tool 2013; 1.0:[Available from: https://bitbucket.org/behkamal/new-metrics-codes/src.
65. Calero, C., M. Piattini, and M. Genero, Empirical validation of referential integrity metrics. Information and Software technology, 2001. 43(15): p. 949-957.
66. Bland, J.M. and D.G. Altman, Statistics notes: Cronbach's alpha. Bmj, 1997. 314(7080): p. 572.
67. Debattista, J., S. Auer, and C Lange, Luzzu—A Methodology and Framework for Linked Data Quality Assessment, ACM Journal of Data and Information Quality, 2016, 8 (1), p. 4:1-4:32.
68. Färber, M,. F. Bartscherer, C. Menne, and A. Rettinger, Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO, Semantic Web Journal, 2017, 00 (20xx), p. 1–53, DOI: 10.3233/SW-170275