Posts

My Fandom Wikis to Obsidian Vaults Converter (Python Code Included)

19 commentsยท0 reblogs
ahmadmanga
75
ยท
0 views
ยท
min-read

The other day, I had the genius idea of creating fanfictions using AI. I had a few anime series in mind, but it didn't really matter which one. It's a "Just for fun"* idea... For that, I needed a lot of data on the characters, setting, and an AI model good at creative writing with a long context length.

For the data part, I decided to collect data from Wikia/Fandom.com pages, and since taking the whole page will waste a lot of AI Tokens, I wanted a fast way to crawl the website and only take the data I wanted.

Image from thread

Method:

Ideally, I want to be able to create a text file containing only the relevant data for the fanfiction I'm creating... So, here's the method I arrived at:

  • Crawl the whole Wiki for the series I'm looking at. (Example: Dragon Ball)
  • Convert the crawled pages into Markdown Files suitable for [Obsidian] Notes App
  • Use Obsidian's Embeddings Plugins, and ChatGPT Plugin, to create an even more focused .txt file containing only the info I need to pass to the AI

For the crawling part, I used Crawl4AI's Docker App, which needs a Pyhton code to call it. I used AI to create a code suitable for crawling Fandom.com websites and removing redundant parts of the page. The result are a batch of markdown files that I put in my Obsidian vault for further refining...


Image from thread


Pre-requites

The Code

So, here's my current code. It's working but kind of clunky, and since I vibe-coded it using AI, I'm not sure how all of it works, (I only understand the basics,) so further editing would be a nightmare...

Still, I decided to share it, just in case anyone is interested:

Disclaimer: Code created using Qwen3-235B model over 1 hour of conversation. The result are the two files below:

bulk_fandom_crawler.py:

The code below is called via terminal: python3 bulk_fandom_crawler.py WEBSITEURL --max-crawl PAGESCOUNT

import csv 
import os 
import re 
import requests 
from urllib.parse import urlparse, urljoin 
from bs4 import BeautifulSoup 
from crawl_to_mkdn import Crawl4AiCrawler 
import time 
import logging 
from typing import Set, List, Tuple 
 
class BulkFandomCrawler: 
    def __init__(self, base_url: str = "http://localhost:11235"): 
        self.crawler = Crawl4AiCrawler(base_url) 
        self.links_csv = "fandom_links.csv" 
        self.finished_csv = "finished_crawled.csv" 
        self.output_root = "crawled_sites" 
        self.log_file = "crawler_logs.txt" 
         
        # Initialize logging 
        self._setup_logging() 
         
        # Initialize CSV files 
        self._init_csv_files() 
     
    def _setup_logging(self): 
        """Setup logging to file and console.""" 
        logging.basicConfig( 
            level=logging.INFO, 
            format='%(asctime)s - %(levelname)s - %(message)s', 
            handlers=[ 
                logging.FileHandler(self.log_file, encoding='utf-8'), 
                logging.StreamHandler() 
            ] 
        ) 
        self.logger = logging.getLogger(__name__) 
     
    def _init_csv_files(self): 
        """Initialize CSV files with headers if they don't exist.""" 
        # Initialize links CSV 
        if not os.path.exists(self.links_csv): 
            with open(self.links_csv, 'w', newline='', encoding='utf-8') as f: 
                writer = csv.writer(f) 
                writer.writerow(['URL', 'Topic_Title', 'Subdomain', 'Status']) 
            self.logger.info(f"๐Ÿ“„ Created new links CSV: {self.links_csv}") 
         
        # Initialize finished CSV 
        if not os.path.exists(self.finished_csv): 
            with open(self.finished_csv, 'w', newline='', encoding='utf-8') as f: 
                writer = csv.writer(f) 
                writer.writerow(['URL', 'Topic_Title', 'Subdomain', 'Crawl_Date', 'Success', 'Word_Count', 'File_Size']) 
            self.logger.info(f"๐Ÿ“„ Created new finished CSV: {self.finished_csv}") 
     
    def _get_finished_urls(self) -> Set[str]: 
        """Get set of already crawled URLs from finished CSV.""" 
        finished_urls = set() 
        if os.path.exists(self.finished_csv): 
            with open(self.finished_csv, 'r', encoding='utf-8') as f: 
                reader = csv.DictReader(f) 
                for row in reader: 
                    finished_urls.add(row['URL']) 
        return finished_urls 
     
    def _extract_title_from_url(self, url: str) -> str: 
        """Extract topic title from fandom URL.""" 
        parsed = urlparse(url) 
        path_parts = parsed.path.strip('/').split('/') 
        if 'wiki' in path_parts: 
            wiki_index = path_parts.index('wiki') 
            if wiki_index + 1 < len(path_parts): 
                title = path_parts[wiki_index + 1] 
                # Replace underscores with spaces and decode URL encoding 
                title = title.replace('_', ' ') 
                return title 
        return "Unknown" 
     
    def _get_subdomain(self, url: str) -> str: 
        """Extract subdomain from fandom URL.""" 
        parsed = urlparse(url) 
        domain_parts = parsed.netloc.split('.') 
        if len(domain_parts) >= 3 and 'fandom.com' in parsed.netloc: 
            return domain_parts[0] 
        return "unknown" 
     
    def discover_links(self, start_url: str) -> List[Tuple[str, str, str]]: 
        """ 
        Discover fandom links from a starting URL. 
        Returns list of (url, title, subdomain) tuples. 
        """ 
        self.logger.info(f"๐Ÿ” Discovering links from: {start_url}") 
         
        try: 
            # Use requests to get the page content for link discovery 
            response = requests.get(start_url, timeout=30) 
            response.raise_for_status() 
            soup = BeautifulSoup(response.content, 'html.parser') 
             
            # Extract subdomain from start URL 
            start_subdomain = self._get_subdomain(start_url) 
             
            # Find all links 
            links = [] 
            for link in soup.find_all('a', href=True): 
                href = link['href'] 
                 
                # Convert relative URLs to absolute 
                if href.startswith('/'): 
                    href = urljoin(start_url, href) 
                 
                # Check if it's a fandom wiki link from same subdomain 
                if self._is_valid_fandom_link(href, start_subdomain): 
                    title = self._extract_title_from_url(href) 
                    subdomain = self._get_subdomain(href) 
                    links.append((href, title, subdomain)) 
             
            # Remove duplicates 
            unique_links = list(set(links)) 
            self.logger.info(f"โœ… Found {len(unique_links)} unique fandom links") 
            return unique_links 
             
        except Exception as e: 
            self.logger.error(f"โŒ Error discovering links from {start_url}: {str(e)}") 
            return [] 
     
    def _is_valid_fandom_link(self, url: str, target_subdomain: str) -> bool: 
        """Check if URL is a valid fandom wiki link from the target subdomain.""" 
        try: 
            parsed = urlparse(url) 
             
            # Must be fandom.com domain 
            if 'fandom.com' not in parsed.netloc: 
                return False 
             
            # Must be from same subdomain 
            if self._get_subdomain(url) != target_subdomain: 
                return False 
             
            # Must be a wiki page 
            if '/wiki/' not in parsed.path: 
                return False 
             
            # Exclude certain pages 
            excluded_patterns = [ 
                'Special:', 'File:', 'Category:', 'Template:', 'User:', 'Talk:', 
                'action=', 'oldid=', '#', '?' 
            ] 
             
            for pattern in excluded_patterns: 
                if pattern in url: 
                    return False 
             
            return True 
             
        except Exception: 
            return False 
     
    def add_links_to_csv(self, links: List[Tuple[str, str, str]]): 
        """Add discovered links to the main links CSV.""" 
        existing_urls = set() 
         
        # Read existing URLs 
        if os.path.exists(self.links_csv): 
            with open(self.links_csv, 'r', encoding='utf-8') as f: 
                reader = csv.DictReader(f) 
                for row in reader: 
                    existing_urls.add(row['URL']) 
         
        # Add new links 
        new_links = 0 
        with open(self.links_csv, 'a', newline='', encoding='utf-8') as f: 
            writer = csv.writer(f) 
            for url, title, subdomain in links: 
                if url not in existing_urls: 
                    writer.writerow([url, title, subdomain, 'pending']) 
                    new_links += 1 
         
        self.logger.info(f"โž• Added {new_links} new links to {self.links_csv}") 
     
    def get_pending_urls(self) -> List[Tuple[str, str, str]]: 
        """Get URLs that haven't been crawled yet.""" 
        finished_urls = self._get_finished_urls() 
        pending_urls = [] 
         
        if os.path.exists(self.links_csv): 
            with open(self.links_csv, 'r', encoding='utf-8') as f: 
                reader = csv.DictReader(f) 
                for row in reader: 
                    if row['URL'] not in finished_urls and row['Status'] == 'pending': 
                        pending_urls.append((row['URL'], row['Topic_Title'], row['Subdomain'])) 
         
        return pending_urls 
     
    def crawl_url(self, url: str, title: str, subdomain: str) -> bool: 
        """Crawl a single URL and record the result.""" 
        try: 
            self.logger.info(f"๐Ÿ•ท๏ธ Crawling: {url}") 
            self.crawler.crawl_and_save(url, self.output_root) 
             
            # Calculate stats for the crawled content 
            word_count, file_size = self._get_content_stats(url) 
             
            # Record success in finished CSV 
            with open(self.finished_csv, 'a', newline='', encoding='utf-8') as f: 
                writer = csv.writer(f) 
                writer.writerow([url, title, subdomain, time.strftime('%Y-%m-%d %H:%M:%S'), 'True', word_count, file_size]) 
             
            self.logger.info(f"โœ… Successfully crawled: {title} ({word_count} words, {file_size} bytes)") 
            return True 
             
        except Exception as e: 
            self.logger.error(f"โŒ Failed to crawl {url}: {str(e)}") 
             
            # Record failure in finished CSV 
            with open(self.finished_csv, 'a', newline='', encoding='utf-8') as f: 
                writer = csv.writer(f) 
                writer.writerow([url, title, subdomain, time.strftime('%Y-%m-%d %H:%M:%S'), 'False', 0, 0]) 
             
            return False 
     
    def _get_content_stats(self, url: str) -> Tuple[int, int]: 
        """Get word count and file size for crawled content.""" 
        try: 
            # Parse URL to find the output file 
            parsed = urlparse(url) 
            site_dir = parsed.netloc 
            path = parsed.path.strip("/") 
             
            full_path_dir = os.path.join(self.output_root, site_dir, path) 
            filename = "index.txt" if not path else f"{path.split('/')[-1]}.txt" 
            output_file = os.path.join(full_path_dir, filename) 
             
            if os.path.exists(output_file): 
                file_size = os.path.getsize(output_file) 
                with open(output_file, 'r', encoding='utf-8') as f: 
                    content = f.read() 
                    word_count = len(content.split()) 
                return word_count, file_size 
             
        except Exception as e: 
            self.logger.warning(f"โš ๏ธ Could not calculate stats for {url}: {str(e)}") 
         
        return 0, 0 
     
    def bulk_crawl(self, max_urls: int = None, delay: float = 1.0): 
        """Perform bulk crawling of pending URLs.""" 
        pending_urls = self.get_pending_urls() 
         
        if not pending_urls: 
            self.logger.info("๐Ÿ“ญ No pending URLs to crawl") 
            return 
         
        if max_urls: 
            pending_urls = pending_urls[:max_urls] 
         
        self.logger.info(f"๐Ÿš€ Starting bulk crawl of {len(pending_urls)} URLs") 
         
        success_count = 0 
        total_words = 0 
        total_size = 0 
         
        for i, (url, title, subdomain) in enumerate(pending_urls, 1): 
            self.logger.info(f"\n๐Ÿ“„ [{i}/{len(pending_urls)}] Processing: {title}") 
             
            if self.crawl_url(url, title, subdomain): 
                success_count += 1 
                # Get stats for this crawl 
                word_count, file_size = self._get_content_stats(url) 
                total_words += word_count 
                total_size += file_size 
             
            # Add delay between requests 
            if delay > 0 and i < len(pending_urls): 
                time.sleep(delay) 
         
        self.logger.info(f"\n๐ŸŽ‰ Bulk crawl completed!") 
        self.logger.info(f"๐Ÿ“Š Results: {success_count}/{len(pending_urls)} successful") 
        self.logger.info(f"๐Ÿ“ Total words: {total_words:,}") 
        self.logger.info(f"๐Ÿ’พ Total size: {total_size:,} bytes") 
     
    def start_crawling(self, start_url: str, discover_new: bool = True, max_crawl: int = None): 
        """Main method to start the crawling process.""" 
        self.logger.info("๐Ÿš€ Starting Bulk Fandom Crawler") 
        self.logger.info(f"๐ŸŒ Start URL: {start_url}") 
         
        # Show current status 
        self._show_status() 
         
        # Validate start URL 
        if not self._is_valid_fandom_link(start_url, self._get_subdomain(start_url)): 
            self.logger.error("โŒ Invalid fandom URL provided") 
            return 
         
        # Check Crawl4AI health 
        try: 
            health = requests.get("http://localhost:11235/health", timeout=10) 
            if health.status_code != 200: 
                self.logger.error("โŒ Crawl4AI service not healthy") 
                return 
            self.logger.info("โœ… Crawl4AI health check passed") 
        except requests.exceptions.RequestException: 
            self.logger.error("โŒ Could not connect to Crawl4AI. Please start the Docker container:") 
            self.logger.error("   docker run -p 11235:11235 ghcr.io/unclecode/crawl4ai:latest") 
            return 
         
        # Discover new links if requested 
        if discover_new: 
            links = self.discover_links(start_url) 
            if links: 
                self.add_links_to_csv(links) 
         
        # Show updated status 
        self._show_status() 
         
        # Start bulk crawling 
        self.bulk_crawl(max_urls=max_crawl) 
     
    def _show_status(self): 
        """Display current crawler status.""" 
        pending_count = len(self.get_pending_urls()) 
        finished_count = len(self._get_finished_urls()) 
         
        self.logger.info("๐Ÿ“Š Current Status:") 
        self.logger.info(f"   ๐Ÿ“‹ Pending URLs: {pending_count}") 
        self.logger.info(f"   โœ… Finished URLs: {finished_count}") 
        self.logger.info(f"   ๐Ÿ“ Output directory: {self.output_root}") 
        self.logger.info(f"   ๐Ÿ“„ Links CSV: {self.links_csv}") 
        self.logger.info(f"   ๐Ÿ“„ Finished CSV: {self.finished_csv}") 
        self.logger.info(f"   ๐Ÿ“ Log file: {self.log_file}") 
 
def main(): 
    import sys 
     
    if len(sys.argv) < 2: 
        print("Usage: python bulk_fandom_crawler.py <START_URL> [--no-discover] [--max-crawl N]") 
        print("Example: python bulk_fandom_crawler.py https://mushokutensei.fandom.com/wiki/Roxy_Migurdia") 
        print("Options:") 
        print("  --no-discover: Skip link discovery, only crawl existing pending URLs") 
        print("  --max-crawl N: Limit crawling to N URLs") 
        sys.exit(1) 
     
    start_url = sys.argv[1] 
    discover_new = '--no-discover' not in sys.argv 
    max_crawl = None 
     
    # Parse max-crawl option 
    if '--max-crawl' in sys.argv: 
        try: 
            max_idx = sys.argv.index('--max-crawl') 
            if max_idx + 1 < len(sys.argv): 
                max_crawl = int(sys.argv[max_idx + 1]) 
        except (ValueError, IndexError): 
            print("โŒ Invalid --max-crawl value") 
            sys.exit(1) 
     
    # Start crawling 
    crawler = BulkFandomCrawler() 
    crawler.start_crawling(start_url, discover_new=discover_new, max_crawl=max_crawl) 
 
if __name__ == "__main__": 
    main() 

crawl2obsidian.py:

#!/usr/bin/env python3 
""" 
Convert crawled Fandom pages to clean Obsidian markdown vault. 
Keeps images but strips all URL links while preserving text content. 
""" 
 
import os 
import re 
from pathlib import Path 
 
def clean_fandom_content(content): 
    """Clean fandom content by removing links but keeping images and text.""" 
     
    # Keep images but remove their URLs - convert ![alt](url) to ![alt] 
    content = re.sub(r'!\[([^\]]*)\]\([^)]+\)', r'![\1]', content) 
     
    # Remove fandom links but keep the display text 
    # Pattern: [text](https://mushokutensei.fandom.com/wiki/Page "Page") 
    content = re.sub(r'\[([^\]]+)\]\(https://[^.]+\.fandom\.com/[^)]+\)', r'\1', content) 
     
    # Remove internal navigation links like [1 Receptionist](url#section) but keep text 
    content = re.sub(r'\[[\d\.]+ ([^\]]+)\]\([^)]+#[^)]+\)', r'\1', content) 
     
    # Remove edit links [[](auth.fandom.com...)] 
    content = re.sub(r'\[\[\]\([^)]+\)\]', '', content) 
     
    # Remove citation links like [[1]](url) but keep the number 
    content = re.sub(r'\[\[(\d+)\]\]\([^)]+\)', r'[\1]', content) 
     
    # Remove "Jump up" reference links 
    content = re.sub(r'\[โ†‘\]\([^)]+\s+"[^"]*"\)', '', content) 
    content = re.sub(r'โ†‘ \[Jump up to: [^\]]+\]', '', content) 
     
    # Remove remaining markdown links but keep text 
    content = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', content) 
     
    # Remove HTML-like tags 
    content = re.sub(r'<[^>]+>', '', content) 
     
    # Remove table separators 
    content = re.sub(r'^---\s*$', '', content, flags=re.MULTILINE) 
     
    # Remove "Sign in to edit" text 
    content = re.sub(r'"Sign in to edit"', '', content) 
     
    # Remove lines that are just numbers (like "1/2", "1/3") 
    content = re.sub(r'^\d+/\d+\s*$', '', content, flags=re.MULTILINE) 
     
    # Remove "Voiced by:" lines with complex formatting 
    content = re.sub(r'Voiced by:.*?(?=\n##|\n\n|\Z)', '', content, flags=re.DOTALL) 
     
    # Clean up navigation sections at the end 
    content = re.sub(r'## Navigation.*?$', '', content, flags=re.DOTALL) 
     
    # Remove expand sections 
    content = re.sub(r'Expand\[.*?\].*?(?=\n##|\n\n|\Z)', '', content, flags=re.DOTALL) 
     
    # Clean up multiple empty lines 
    content = re.sub(r'\n\s*\n\s*\n+', '\n\n', content) 
     
    # Remove empty lines at start and end 
    content = content.strip() 
     
    return content 
 
def is_empty_content(content): 
    """Check if content is empty or only contains '(No content extracted)'.""" 
    cleaned = content.strip() 
    return ( 
        not cleaned or 
        cleaned == "(No content extracted)" or 
        len(cleaned) < 10  # Very short content is likely meaningless 
    ) 
 
def remove_empty_directories(directory): 
    """Recursively remove empty directories.""" 
    removed_count = 0 
     
    # Walk bottom-up to remove empty directories 
    for root, dirs, files in os.walk(directory, topdown=False): 
        for dir_name in dirs: 
            dir_path = os.path.join(root, dir_name) 
            try: 
                # Try to remove if empty 
                os.rmdir(dir_path) 
                removed_count += 1 
                print(f"Removed empty directory: {dir_path}") 
            except OSError: 
                # Directory not empty, skip 
                pass 
     
    return removed_count 
 
def find_page_groups(input_dir): 
    """Find groups of files that should be consolidated into single pages.""" 
    page_groups = {} 
    all_files = list(input_dir.rglob('*.txt')) 
     
    # First, identify potential main pages and their subpages 
    for txt_file in all_files: 
        relative_path = txt_file.relative_to(input_dir) 
        path_parts = relative_path.parts 
         
        # Look for main page files (at directory level, not in subdirectories) 
        if len(path_parts) >= 2: 
            parent_dir = relative_path.parent 
            filename = relative_path.stem 
             
            # Check if there are subdirectories with the same base name as this file 
            potential_subpage_dirs = [] 
            for other_file in all_files: 
                other_relative = other_file.relative_to(input_dir) 
                other_parts = other_relative.parts 
                 
                # Check if this other file is in a subdirectory of the same parent 
                # and the subdirectory name could be a subpage of our main file 
                if (len(other_parts) >= 3 and 
                    other_relative.parent.parent == parent_dir and 
                    other_relative.stem == other_parts[-2]):  # filename matches its directory 
                     
                    # Check if this could be a subpage of our main file 
                    subpage_dir = other_parts[-2] 
                    if subpage_dir.lower() in ['appearance', 'relationships', 'story', 'chronology', 
                                             'gallery', 'powers_and_abilities', 'future']: 
                        potential_subpage_dirs.append(other_file) 
             
            # If we found potential subpages, create a group 
            if potential_subpage_dirs: 
                group_key = f"{parent_dir}/{filename}" 
                if group_key not in page_groups: 
                    page_groups[group_key] = {'main': txt_file, 'subpages': potential_subpage_dirs} 
     
    # Find standalone files (not part of any group) 
    grouped_files = set() 
    for group in page_groups.values(): 
        grouped_files.add(group['main']) 
        grouped_files.update(group['subpages']) 
     
    standalone_files = [f for f in all_files if f not in grouped_files] 
     
    return page_groups, standalone_files 
 
def process_page_group(group, input_dir, output_dir): 
    """Process a group of related files into a single markdown file.""" 
    main_file = group['main'] 
    subpages = sorted(group['subpages'], key=lambda x: x.name) 
     
    combined_content = [] 
     
    # Process main file first 
    try: 
        with open(main_file, 'r', encoding='utf-8') as f: 
            main_content = f.read() 
         
        if not is_empty_content(main_content): 
            cleaned_main = clean_fandom_content(main_content) 
            if not is_empty_content(cleaned_main): 
                combined_content.append(cleaned_main) 
    except Exception as e: 
        print(f"Error reading main file {main_file}: {e}") 
     
    # Process subpages 
    for subpage in subpages: 
        try: 
            with open(subpage, 'r', encoding='utf-8') as f: 
                content = f.read() 
             
            if not is_empty_content(content): 
                cleaned_content = clean_fandom_content(content) 
                if not is_empty_content(cleaned_content): 
                    # Use the subpage filename as section header 
                    section_name = subpage.stem.replace('_', ' ') 
                    combined_content.append(f"\n\n# {section_name}\n\n{cleaned_content}") 
        except Exception as e: 
            print(f"Error reading subpage {subpage}: {e}") 
     
    if combined_content: 
        # Create output path 
        relative_path = main_file.relative_to(input_dir) 
        filename_with_spaces = relative_path.stem.replace('_', ' ') 
        output_path = relative_path.parent / f"{filename_with_spaces}.md" 
        output_file = output_dir / output_path 
         
        # Create output directory if needed 
        output_file.parent.mkdir(parents=True, exist_ok=True) 
         
        # Write combined content 
        final_content = '\n\n'.join(combined_content) 
        with open(output_file, 'w', encoding='utf-8') as f: 
            f.write(final_content) 
         
        return True, len(group['subpages']) + 1  # +1 for main file 
     
    return False, 0 
 
def process_crawled_files(): 
    """Process all crawled .txt files and convert to clean markdown.""" 
     
    input_dir = Path('crawled_sites') 
    output_dir = Path('obsidian_vault') 
     
    # Create output directory 
    output_dir.mkdir(exist_ok=True) 
     
    processed_count = 0 
    skipped_count = 0 
    consolidated_count = 0 
     
    # Find page groups for consolidation 
    print("Analyzing file structure for consolidation...") 
    page_groups, standalone_files = find_page_groups(input_dir) 
     
    print(f"Found {len(page_groups)} page groups to consolidate") 
    print(f"Found {len(standalone_files)} standalone files") 
     
    # Process consolidated page groups 
    for group_key, group in page_groups.items(): 
        try: 
            success, file_count = process_page_group(group, input_dir, output_dir) 
            if success: 
                processed_count += 1 
                consolidated_count += file_count 
                print(f"Consolidated: {group['main'].relative_to(input_dir)} (+{len(group['subpages'])} subpages)") 
            else: 
                skipped_count += file_count 
        except Exception as e: 
            print(f"Error processing group {group_key}: {e}") 
            skipped_count += len(group['subpages']) + 1 
     
    # Process standalone files 
    for txt_file in standalone_files: 
        try: 
            # Read original content 
            with open(txt_file, 'r', encoding='utf-8') as f: 
                content = f.read() 
             
            # Skip if content is empty or meaningless 
            if is_empty_content(content): 
                print(f"Skipped empty file: {txt_file.relative_to(input_dir)}") 
                skipped_count += 1 
                continue 
             
            # Clean the content 
            cleaned_content = clean_fandom_content(content) 
             
            # Double-check if cleaned content is now empty 
            if is_empty_content(cleaned_content): 
                print(f"Skipped file with no useful content after cleaning: {txt_file.relative_to(input_dir)}") 
                skipped_count += 1 
                continue 
             
            # Create output path maintaining directory structure 
            relative_path = txt_file.relative_to(input_dir) 
             
            # Convert underscores to spaces in filename for easier linking 
            filename_with_spaces = relative_path.stem.replace('_', ' ') 
            output_path = relative_path.parent / f"{filename_with_spaces}.md" 
            output_file = output_dir / output_path 
             
            # Create output directory if needed 
            output_file.parent.mkdir(parents=True, exist_ok=True) 
             
            # Write cleaned content 
            with open(output_file, 'w', encoding='utf-8') as f: 
                f.write(cleaned_content) 
             
            processed_count += 1 
            print(f"Processed: {relative_path}") 
             
        except Exception as e: 
            print(f"Error processing {txt_file}: {e}") 
     
    # Remove empty directories 
    print("\nRemoving empty directories...") 
    removed_dirs = remove_empty_directories(output_dir) 
     
    print(f"\nCompleted!") 
    print(f"Processed: {processed_count} files") 
    print(f"Consolidated: {consolidated_count} files into page groups") 
    print(f"Skipped: {skipped_count} empty files") 
    print(f"Removed: {removed_dirs} empty directories") 
    print(f"Output directory: {output_dir.absolute()}") 
 
if __name__ == "__main__": 
    process_crawled_files() 

Posted Using INLEO